"tests/models/test_blip2.py" did not exist on "ecb33a28cb6c10ebf3b1aa139f72e759cacb8c15"
Commit 711aa9d5 authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.10.0' into v0.10.0-dev

parents 751c492c 6d8d0a24
---
title: Disaggregated Prefilling (experimental)
---
[](){ #disagg-prefill }
# Disaggregated Prefilling (experimental)
This page introduces you the disaggregated prefilling feature in vLLM.
......
---
title: LoRA Adapters
---
[](){ #lora-adapter }
# LoRA Adapters
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
......@@ -29,7 +26,7 @@ We can now submit the prompts and call `llm.generate` with the `lora_request` pa
of `LoRARequest` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
the third parameter is the path to the LoRA adapter.
??? Code
??? code
```python
sampling_params = SamplingParams(
......@@ -70,7 +67,7 @@ The server entrypoint accepts all other LoRA configuration parameters (`max_lora
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
with its base model (if `jq` is not installed, you can follow [this guide](https://jqlang.org/download/) to install it.):
??? Command
??? console "Command"
```bash
curl localhost:8000/v1/models | jq .
......@@ -172,7 +169,7 @@ Alternatively, follow these example steps to implement your own plugin:
1. Implement the LoRAResolver interface.
??? Example of a simple S3 LoRAResolver implementation
??? code "Example of a simple S3 LoRAResolver implementation"
```python
import os
......@@ -238,7 +235,7 @@ The new format of `--lora-modules` is mainly to support the display of parent mo
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
- The `root` field points to the artifact location of the lora adapter.
??? Command output
??? console "Command output"
```bash
$ curl http://localhost:8000/v1/models
......@@ -275,3 +272,80 @@ The new format of `--lora-modules` is mainly to support the display of parent mo
]
}
```
## Default LoRA Models For Multimodal Models
Some models, e.g., [Granite Speech](https://huggingface.co/ibm-granite/granite-speech-3.3-8b) and [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) multimodal, contain LoRA adapter(s) that are expected to always be applied when a given modality is present. This can be a bit tedious to manage with the above approaches, as it requires the user to send the `LoRARequest` (offline) or to filter requests between the base model and LoRA model (server) depending on the content of the request's multimodal data.
To this end, we allow registration of default multimodal LoRAs to handle this automatically, where users can map each modality to a LoRA adapter to automatically apply it when the corresponding inputs are present. Note that currently, we only allow one LoRA per prompt; if several modalities are provided, each of which are registered to a given modality, none of them will be applied.
??? code "Example usage for offline inference"
```python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
model_id = "ibm-granite/granite-speech-3.3-2b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
def get_prompt(question: str, has_audio: bool):
"""Build the input prompt to send to vLLM."""
if has_audio:
question = f"<|audio|>{question}"
chat = [
{
"role": "user",
"content": question
}
]
return tokenizer.apply_chat_template(chat, tokenize=False)
llm = LLM(
model=model_id,
enable_lora=True,
max_lora_rank=64,
max_model_len=2048,
limit_mm_per_prompt={"audio": 1},
# Will always pass a `LoRARequest` with the `model_id`
# whenever audio is contained in the request data.
default_mm_loras = {"audio": model_id},
enforce_eager=True,
)
question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
question=question,
has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate
inputs = {
"prompt": prompt_with_audio,
"multi_modal_data": {
"audio": audio,
}
}
outputs = llm.generate(
inputs,
sampling_params=SamplingParams(
temperature=0.2,
max_tokens=64,
),
)
```
You can also pass a json dictionary of `--default-mm-loras` mapping modalities to LoRA model IDs. For example, when starting the server:
```bash
vllm serve ibm-granite/granite-speech-3.3-2b \
--max-model-len 2048 \
--enable-lora \
--default-mm-loras '{"audio":"ibm-granite/granite-speech-3.3-2b"}' \
--max-lora-rank 64
```
Note: Default multimodal LoRAs are currently only available for `.generate` and chat completions.
---
title: Multimodal Inputs
---
[](){ #multimodal-inputs }
# Multimodal Inputs
This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
......@@ -20,7 +17,7 @@ To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
??? Code
??? code
```python
from vllm import LLM
......@@ -68,7 +65,7 @@ Full example: <gh-file:examples/offline_inference/vision_language.py>
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
??? Code
??? code
```python
from vllm import LLM
......@@ -101,7 +98,7 @@ To substitute multiple images inside the same text prompt, you can pass in a lis
Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py>
If using the [LLM.chat](https://docs.vllm.ai/en/stable/models/generative_models.html#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
```python
from vllm import LLM
......@@ -146,7 +143,7 @@ for o in outputs:
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
??? Code
??? code
```python
from vllm import LLM
......@@ -193,7 +190,7 @@ Full example: <gh-file:examples/offline_inference/audio_language.py>
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
??? Code
??? code
```python
from vllm import LLM
......@@ -220,7 +217,7 @@ pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the cor
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
??? Code
??? code
```python
# Construct the prompt based on your model
......@@ -288,7 +285,7 @@ vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
Then, you can use the OpenAI client as follows:
??? Code
??? code
```python
from openai import OpenAI
......@@ -366,7 +363,7 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model
Then, you can use the OpenAI client as follows:
??? Code
??? code
```python
from openai import OpenAI
......@@ -430,7 +427,7 @@ vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b
Then, you can use the OpenAI client as follows:
??? Code
??? code
```python
import base64
......@@ -486,7 +483,7 @@ Then, you can use the OpenAI client as follows:
Alternatively, you can pass `audio_url`, which is the audio counterpart of `image_url` for image input:
??? Code
??? code
```python
chat_completion_from_url = client.chat.completions.create(
......@@ -531,7 +528,7 @@ pass a tensor of shape to the corresponding field of the multi-modal dictionary.
For image embeddings, you can pass the base64-encoded tensor to the `image_embeds` field.
The following example demonstrates how to pass image embeddings to the OpenAI server:
??? Code
??? code
```python
image_embedding = torch.load(...)
......
---
title: Quantization
---
[](){ #quantization-index }
# Quantization
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
......@@ -13,6 +10,7 @@ Contents:
- [BitBLAS](bitblas.md)
- [GGUF](gguf.md)
- [GPTQModel](gptqmodel.md)
- [INC](inc.md)
- [INT4 W4A16](int4.md)
- [INT8 W8A8](int8.md)
- [FP8 W8A8](fp8.md)
......
---
title: AutoAWQ
---
[](){ #auto-awq }
# AutoAWQ
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
......@@ -15,7 +12,7 @@ pip install autoawq
After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
??? Code
??? code
```python
from awq import AutoAWQForCausalLM
......@@ -51,7 +48,7 @@ python examples/offline_inference/llm_engine_example.py \
AWQ models are also supported directly through the LLM entrypoint:
??? Code
??? code
```python
from vllm import LLM, SamplingParams
......
---
title: BitBLAS
---
[](){ #bitblas }
# BitBLAS
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
!!! note
Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
For details see [supported hardware](supported_hardware.md).
Below are the steps to utilize BitBLAS with vLLM.
......@@ -43,7 +40,7 @@ llm = LLM(
## Read gptq format checkpoint
??? Code
??? code
```python
from vllm import LLM
......
---
title: BitsAndBytes
---
[](){ #bits-and-bytes }
# BitsAndBytes
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
......
---
title: FP8 W8A8
---
[](){ #fp8 }
# FP8 W8A8
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
......@@ -58,7 +55,7 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
??? Code
??? code
```python
from llmcompressor.transformers import oneshot
......@@ -89,8 +86,9 @@ Load and run the model in `vllm`:
```python
from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
result = model.generate("Hello my name is")
llm = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
result = llm.generate("Hello my name is")
print(result[0].outputs[0].text)
```
......@@ -128,9 +126,10 @@ In this mode, all Linear modules (except for the final `lm_head`) have their wei
```python
from vllm import LLM
model = LLM("facebook/opt-125m", quantization="fp8")
llm = LLM("facebook/opt-125m", quantization="fp8")
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
result = model.generate("Hello, my name is")
result = llm.generate("Hello, my name is")
print(result[0].outputs[0].text)
```
......
---
title: GGUF
---
[](){ #gguf }
# GGUF
!!! warning
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
......@@ -41,7 +38,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
You can also use the GGUF model directly through the LLM entrypoint:
??? Code
??? code
```python
from vllm import LLM, SamplingParams
......
---
title: GPTQModel
---
[](){ #gptqmodel }
# GPTQModel
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
......@@ -31,7 +28,7 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
??? Code
??? code
```python
from datasets import load_dataset
......@@ -69,7 +66,7 @@ python examples/offline_inference/llm_engine_example.py \
GPTQModel quantized models are also supported directly through the LLM entrypoint:
??? Code
??? code
```python
from vllm import LLM, SamplingParams
......
---
title: FP8 INC
---
[](){ #inc }
vLLM supports FP8 (8-bit floating point) weight and activation quantization using Intel® Neural Compressor (INC) on Intel® Gaudi® 2 and Intel® Gaudi® 3 AI accelerators.
Currently, quantization is validated only in Llama models.
Intel Gaudi supports quantization of various modules and functions, including, but not limited to `Linear`, `KVCache`, `Matmul` and `Softmax`. For more information, please refer to:
[Supported Modules\\Supported Functions\\Custom Patched Modules](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-modules).
!!! note
Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in the [vllm-hpu-extention](https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration/README.md) package.
!!! note
`QUANT_CONFIG` is an environment variable that points to the measurement or quantization [JSON config file](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-json-config-file-options).
The measurement configuration file is used during the calibration procedure to collect measurements for a given model. The quantization configuration is used during inference.
## Run Online Inference Using FP8
Once you've completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following command:
```bash
export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor_paralel_size 8
```
!!! tip
If you are just prototyping or testing your model with FP8, you can use the `VLLM_SKIP_WARMUP=true` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments as it causes a significant performance drop.
!!! tip
When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use the below environment variables:
`VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
`VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.
## Run Offline Inference Using FP8
To run offline inference (after completing the model calibration process):
* Set the "QUANT_CONFIG" environment variable to point to a JSON configuration file with QUANTIZE mode.
* Pass `quantization=inc` and `kv_cache_dtype=fp8_inc` as parameters to the `LLM` object.
* Call shutdown method of the model_executor at the end of the run.
```python
from vllm import LLM
llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc", kv_cache_dtype="fp8_inc")
...
# Call llm.generate on the required prompts and sampling params.
...
llm.llm_engine.model_executor.shutdown()
```
## Device for the Model's Weights Uploading
The unquantized weights are first loaded onto the CPU, then quantized and transferred to the target device (HPU) for model execution.
This reduces the device memory footprint of model weights, as only quantized weights are stored in the device memory.
---
title: INT4 W4A16
---
[](){ #int4 }
# INT4 W4A16
vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
......@@ -53,7 +50,7 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd
It's best to use calibration data that closely matches your deployment data.
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
??? Code
??? code
```python
from datasets import load_dataset
......@@ -78,7 +75,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra
Now, apply the quantization algorithms:
??? Code
??? code
```python
from llmcompressor.transformers import oneshot
......@@ -111,7 +108,8 @@ After quantization, you can load and run the model in vLLM:
```python
from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
llm = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
```
To evaluate accuracy, you can use `lm_eval`:
......@@ -141,7 +139,7 @@ lm_eval --model vllm \
The following is an example of an expanded quantization recipe you can tune to your own use case:
??? Code
??? code
```python
from compressed_tensors.quantization import (
......
---
title: INT8 W8A8
---
[](){ #int8 }
# INT8 W8A8
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance.
......@@ -54,7 +51,7 @@ When quantizing activations to INT8, you need sample data to estimate the activa
It's best to use calibration data that closely matches your deployment data.
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
??? Code
??? code
```python
from datasets import load_dataset
......@@ -81,7 +78,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra
Now, apply the quantization algorithms:
??? Code
??? code
```python
from llmcompressor.transformers import oneshot
......@@ -117,7 +114,8 @@ After quantization, you can load and run the model in vLLM:
```python
from vllm import LLM
model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
llm = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
```
To evaluate accuracy, you can use `lm_eval`:
......
......@@ -14,7 +14,7 @@ You can quantize HuggingFace models using the example scripts provided in the Te
Below is an example showing how to quantize a model using modelopt's PTQ API:
??? Code
??? code
```python
import modelopt.torch.quantization as mtq
......@@ -50,7 +50,7 @@ with torch.inference_mode():
The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
??? Code
??? code
```python
from vllm import LLM, SamplingParams
......
---
title: Quantized KV Cache
---
[](){ #quantized-kvcache }
# Quantized KV Cache
## FP8 KV Cache
......@@ -35,7 +32,7 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades
Here is an example of how to enable FP8 quantization:
??? Code
??? code
```python
# To calculate kv cache scales on the fly enable the calculate_kv_scales
......@@ -73,7 +70,7 @@ pip install llmcompressor
Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern):
??? Code
??? code
```python
from datasets import load_dataset
......
---
title: AMD Quark
---
[](){ #quark }
# AMD Quark
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
......@@ -42,7 +39,7 @@ The Quark quantization process can be listed for 5 steps as below:
Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index)
to fetch model and tokenizer.
??? Code
??? code
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
......@@ -65,7 +62,7 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic
to load calibration data. For more details about how to use calibration datasets efficiently, please refer
to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html).
??? Code
??? code
```python
from datasets import load_dataset
......@@ -98,7 +95,7 @@ kv-cache and the quantization algorithm is AutoSmoothQuant.
AutoSmoothQuant config file for Llama is
`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
??? Code
??? code
```python
from quark.torch.quantization import (Config, QuantizationConfig,
......@@ -145,7 +142,7 @@ HuggingFace `safetensors`, you can refer to
[HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html)
for more exporting format details.
??? Code
??? code
```python
import torch
......@@ -176,7 +173,7 @@ for more exporting format details.
Now, you can load and run the Quark quantized model directly through the LLM entrypoint:
??? Code
??? code
```python
from vllm import LLM, SamplingParams
......@@ -232,3 +229,28 @@ python3 quantize_quark.py --model_dir meta-llama/Llama-2-70b-chat-hf \
--model_export hf_format \
--tasks gsm8k
```
## Using MXFP4 models
vLLM supports loading MXFP4 models quantized offline through AMD Quark, compliant with [Open Compute Project (OCP) specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
The scheme currently only supports dynamic quantization for activations.
Example usage, after installing the latest AMD Quark release:
```bash
vllm serve fxmarty/qwen_1.5-moe-a2.7b-mxfp4 --tensor-parallel-size 1
```
A simulation of the matrix multiplication execution in MXFP4 can be run on devices that do not support MXFP4 operations natively (e.g. AMD Instinct MI325, MI300 and MI250), dequantizing weights from MXFP4 to half precision on the fly, using a fused kernel. This is useful e.g. to evaluate MXFP4 models using vLLM, or alternatively to benefit from the ~4x memory savings (compared to float16 and bfloat16).
To generate offline models quantized using MXFP4 data type, the easiest approach is to use AMD Quark's [quantization script](https://quark.docs.amd.com/latest/pytorch/example_quark_torch_llm_ptq.html), as an example:
```bash
python quantize_quark.py --model_dir Qwen/Qwen1.5-MoE-A2.7B-Chat \
--quant_scheme w_mxfp4_a_mxfp4_sym \
--output_dir qwen_1.5-moe-a2.7b-mxfp4 \
--skip_evaluation \
--model_export hf_format \
--group_size 32
```
---
title: Supported Hardware
---
[](){ #quantization-supported-hardware }
# Supported Hardware
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | AWS Neuron | Google TPU |
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|------------------|--------------|
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ❌ |
| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | Intel Gaudi | x86 CPU | AWS Neuron | Google TPU |
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-------------|-----------|--------------|--------------|
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ |
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ | ❌ |
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ❌ |
| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| INC (W8A8) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | ❌ | ❌ | ❌ |
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware.
......
......@@ -15,7 +15,7 @@ pip install \
## Quantizing HuggingFace Models
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
??? Code
??? code
```Python
import torch
......
---
title: Reasoning Outputs
---
[](){ #reasoning-outputs }
# Reasoning Outputs
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
......@@ -17,6 +14,7 @@ vLLM currently supports the following reasoning models:
| [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) | `deepseek_r1` | `guided_json`, `guided_regex` | ✅ |
| [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | ❌ | ❌ |
| [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` | ✅ |
| [Hunyuan A13B series](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | `hunyuan_a13b` | `guided_json`, `guided_regex` | ✅ |
!!! note
IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
......@@ -33,7 +31,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
Next, make a request to the model that should return the reasoning content in the response.
??? Code
??? code
```python
from openai import OpenAI
......@@ -70,7 +68,7 @@ The `reasoning_content` field contains the reasoning steps that led to the final
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
??? Json
??? console "Json"
```json
{
......@@ -95,7 +93,7 @@ Streaming chat completions are also supported for reasoning models. The `reasoni
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
??? Code
??? code
```python
from openai import OpenAI
......@@ -152,7 +150,7 @@ Remember to check whether the `reasoning_content` exists in the response before
The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
??? Code
??? code
```python
from openai import OpenAI
......@@ -200,7 +198,7 @@ For more examples, please refer to <gh-file:examples/online_serving/openai_chat_
You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
??? Code
??? code
```python
# import the required packages
......@@ -258,7 +256,7 @@ You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
??? Code
??? code
```python
@dataclass
......
---
title: Speculative Decoding
---
[](){ #spec-decode }
# Speculative Decoding
!!! warning
Please note that speculative decoding in vLLM is not yet optimized and does
......@@ -18,7 +15,7 @@ Speculative decoding is a technique which improves inter-token latency in memory
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
??? Code
??? code
```python
from vllm import LLM, SamplingParams
......@@ -62,7 +59,7 @@ python -m vllm.entrypoints.openai.api_server \
Then use a client:
??? Code
??? code
```python
from openai import OpenAI
......@@ -103,7 +100,7 @@ Then use a client:
The following code configures vLLM to use speculative decoding where proposals are generated by
matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
??? Code
??? code
```python
from vllm import LLM, SamplingParams
......@@ -137,7 +134,7 @@ draft models that conditioning draft predictions on both context vectors and sam
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
[this technical report](https://arxiv.org/abs/2404.19124).
??? Code
??? code
```python
from vllm import LLM, SamplingParams
......@@ -185,7 +182,7 @@ A variety of speculative models of this type are available on HF hub:
The following code configures vLLM to use speculative decoding where proposals are generated by
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
??? Code
??? code
```python
from vllm import LLM, SamplingParams
......@@ -217,8 +214,8 @@ an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https
A few important things to consider when using the EAGLE based draft models:
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
be able to be loaded and used directly by vLLM after [PR 12304](https://github.com/vllm-project/vllm/pull/12304).
If you are using vllm version before [PR 12304](https://github.com/vllm-project/vllm/pull/12304), please use the
be able to be loaded and used directly by vLLM after <gh-pr:12304>.
If you are using vllm version before <gh-pr:12304>, please use the
[script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue.
......@@ -228,7 +225,7 @@ A few important things to consider when using the EAGLE based draft models:
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
investigation and tracked here: <gh-issue:9565>.
A variety of EAGLE draft models are available on the Hugging Face hub:
......@@ -259,17 +256,17 @@ speculative decoding, breaking down the guarantees into three key areas:
2. **Algorithmic Losslessness**
\- vLLM’s implementation of speculative decoding is algorithmically validated to be lossless. Key validation tests include:
> - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
> distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
> - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
> without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
> provides a lossless guarantee. Almost all of the tests in <gh-dir:tests/spec_decode/e2e>.
> verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
> - **Rejection Sampler Convergence**: Ensures that samples from vLLM’s rejection sampler align with the target
> distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
> - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
> without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
> provides a lossless guarantee. Almost all of the tests in <gh-dir:tests/spec_decode/e2e>.
> verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors:
......@@ -278,7 +275,7 @@ can occur due to following factors:
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability.
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
## Resources for vLLM contributors
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment