"vllm/vscode:/vscode.git/clone" did not exist on "9d07a3d6e472c8e5a231a34ec9c38084605b037d"
Unverified Commit a1fe24d9 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Migrate docs from Sphinx to MkDocs (#18145)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent d0bc2f81
(lora-adapter)= ---
title: LoRA Adapters
# LoRA Adapters ---
[](){ #lora-adapter }
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model. This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
LoRA adapters can be used with any vLLM model that implements {class}`~vllm.model_executor.models.interfaces.SupportsLoRA`. LoRA adapters can be used with any vLLM model that implements [SupportsLoRA][vllm.model_executor.models.interfaces.SupportsLoRA].
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
them locally with them locally with
...@@ -60,9 +61,8 @@ vllm serve meta-llama/Llama-2-7b-hf \ ...@@ -60,9 +61,8 @@ vllm serve meta-llama/Llama-2-7b-hf \
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/ --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
``` ```
:::{note} !!! note
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one. The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
:::
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`, The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
......
(multimodal-inputs)= ---
title: Multimodal Inputs
---
[](){ #multimodal-inputs }
# Multimodal Inputs This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM. !!! note
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
:::{note} and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
:::
## Offline Inference ## Offline Inference
To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`: To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
- `prompt`: The prompt should follow the format that is documented on HuggingFace. - `prompt`: The prompt should follow the format that is documented on HuggingFace.
- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.inputs.MultiModalDataDict`. - `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.multimodal.inputs.MultiModalDataDict][].
### Image Inputs ### Image Inputs
...@@ -211,16 +211,15 @@ for o in outputs: ...@@ -211,16 +211,15 @@ for o in outputs:
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
:::{important} !!! warning
A chat template is **required** to use Chat Completions API. A chat template is **required** to use Chat Completions API.
For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`. For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>. If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>.
If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument. If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
For certain models, we provide alternative chat templates inside <gh-dir:vllm/examples>. For certain models, we provide alternative chat templates inside <gh-dir:vllm/examples>.
For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision. For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision.
:::
### Image Inputs ### Image Inputs
...@@ -284,25 +283,21 @@ print("Chat completion output:", chat_response.choices[0].message.content) ...@@ -284,25 +283,21 @@ print("Chat completion output:", chat_response.choices[0].message.content)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
:::{tip} !!! tip
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine, Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
and pass the file path as `url` in the API request. and pass the file path as `url` in the API request.
:::
:::{tip}
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
:::
:::{note} !!! tip
By default, the timeout for fetching images through HTTP URL is `5` seconds. There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
You can override this by setting the environment variable: In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
```console !!! note
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout> By default, the timeout for fetching images through HTTP URL is `5` seconds.
``` You can override this by setting the environment variable:
::: ```console
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
```
### Video Inputs ### Video Inputs
...@@ -357,15 +352,13 @@ print("Chat completion output from image url:", result) ...@@ -357,15 +352,13 @@ print("Chat completion output from image url:", result)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
:::{note} !!! note
By default, the timeout for fetching videos through HTTP URL is `30` seconds. By default, the timeout for fetching videos through HTTP URL is `30` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```console
export VLLM_VIDEO_FETCH_TIMEOUT=<timeout> export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
``` ```
:::
### Audio Inputs ### Audio Inputs
...@@ -461,15 +454,13 @@ print("Chat completion output from audio url:", result) ...@@ -461,15 +454,13 @@ print("Chat completion output from audio url:", result)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
:::{note} !!! note
By default, the timeout for fetching audios through HTTP URL is `10` seconds. By default, the timeout for fetching audios through HTTP URL is `10` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
```
::: ```console
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
```
### Embedding Inputs ### Embedding Inputs
...@@ -535,7 +526,6 @@ chat_completion = client.chat.completions.create( ...@@ -535,7 +526,6 @@ chat_completion = client.chat.completions.create(
) )
``` ```
:::{note} !!! note
Only one message can contain `{"type": "image_embeds"}`. Only one message can contain `{"type": "image_embeds"}`.
If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc. If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
:::
...@@ -6,13 +6,12 @@ This page teaches you how to pass prompt embedding inputs to vLLM. ...@@ -6,13 +6,12 @@ This page teaches you how to pass prompt embedding inputs to vLLM.
The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary. The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary.
:::{note} !!! note
Prompt embeddings are currently only supported in the v0 engine. Prompt embeddings are currently only supported in the v0 engine.
:::
## Offline Inference ## Offline Inference
To input multi-modal data, follow this schema in {class}`vllm.inputs.EmbedsPrompt`: To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:
- `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model. - `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model.
......
---
title: Quantization
---
[](){ #quantization-index }
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
Contents:
- [Supported_Hardware](supported_hardware.md)
- [Auto_Awq](auto_awq.md)
- [Bnb](bnb.md)
- [Bitblas](bitblas.md)
- [Gguf](gguf.md)
- [Gptqmodel](gptqmodel.md)
- [Int4](int4.md)
- [Int8](int8.md)
- [Fp8](fp8.md)
- [Modelopt](modelopt.md)
- [Quark](quark.md)
- [Quantized_Kvcache](quantized_kvcache.md)
- [Torchao](torchao.md)
(auto-awq)= ---
title: AutoAWQ
# AutoAWQ ---
[](){ #auto-awq }
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ). To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint. Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
......
(bitblas)= ---
title: BitBLAS
# BitBLAS ---
[](){ #bitblas }
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations. vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
:::{note} !!! note
Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`). Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper. Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html). For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
:::
Below are the steps to utilize BitBLAS with vLLM. Below are the steps to utilize BitBLAS with vLLM.
......
(bits-and-bytes)= ---
title: BitsAndBytes
# BitsAndBytes ---
[](){ #bits-and-bytes }
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference. vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy. BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
......
(fp8)= ---
title: FP8 W8A8
# FP8 W8A8 ---
[](){ #fp8 }
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8. Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
...@@ -14,10 +15,9 @@ The FP8 types typically supported in hardware have two distinct representations, ...@@ -14,10 +15,9 @@ The FP8 types typically supported in hardware have two distinct representations,
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`. - **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values. - **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
:::{note} !!! note
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper). FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin. FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
:::
## Installation ## Installation
...@@ -94,9 +94,8 @@ print(result[0].outputs[0].text) ...@@ -94,9 +94,8 @@ print(result[0].outputs[0].text)
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`): Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
:::{note} !!! note
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations. Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
:::
```console ```console
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic $ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
...@@ -133,6 +132,5 @@ result = model.generate("Hello, my name is") ...@@ -133,6 +132,5 @@ result = model.generate("Hello, my name is")
print(result[0].outputs[0].text) print(result[0].outputs[0].text)
``` ```
:::{warning} !!! warning
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
:::
(gguf)= ---
title: GGUF
---
[](){ #gguf }
# GGUF !!! warning
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
:::{warning} !!! warning
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team. Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
:::
:::{warning}
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
:::
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command: To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
...@@ -25,9 +24,8 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen ...@@ -25,9 +24,8 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
``` ```
:::{warning} !!! warning
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size. We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
:::
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
......
(gptqmodel)= ---
title: GPTQModel
# GPTQModel ---
[](){ #gptqmodel }
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI. To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
......
(int4)= ---
title: INT4 W4A16
# INT4 W4A16 ---
[](){ #int4 }
vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS). vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int4-llms-for-vllm-668ec34bf3c9fa45f857df2c). Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int4-llms-for-vllm-668ec34bf3c9fa45f857df2c).
:::{note} !!! note
INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell). INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell).
:::
## Prerequisites ## Prerequisites
...@@ -121,9 +121,8 @@ $ lm_eval --model vllm \ ...@@ -121,9 +121,8 @@ $ lm_eval --model vllm \
--batch_size 'auto' --batch_size 'auto'
``` ```
:::{note} !!! note
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations. Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
:::
## Best Practices ## Best Practices
......
(int8)= ---
title: INT8 W8A8
# INT8 W8A8 ---
[](){ #int8 }
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration. vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance. This quantization method is particularly useful for reducing model size while maintaining good performance.
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415). Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
:::{note} !!! note
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell). INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell).
:::
## Prerequisites ## Prerequisites
...@@ -125,9 +125,8 @@ $ lm_eval --model vllm \ ...@@ -125,9 +125,8 @@ $ lm_eval --model vllm \
--batch_size 'auto' --batch_size 'auto'
``` ```
:::{note} !!! note
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations. Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
:::
## Best Practices ## Best Practices
......
(quantized-kvcache)= ---
title: Quantized KV Cache
# Quantized KV Cache ---
[](){ #quantized-kvcache }
## FP8 KV Cache ## FP8 KV Cache
......
(quark)= ---
title: AMD QUARK
# AMD QUARK ---
[](){ #quark }
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/), throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
...@@ -86,13 +87,12 @@ We need to set the quantization configuration, you can check ...@@ -86,13 +87,12 @@ We need to set the quantization configuration, you can check
for further details. Here we use FP8 per-tensor quantization on weight, activation, for further details. Here we use FP8 per-tensor quantization on weight, activation,
kv-cache and the quantization algorithm is AutoSmoothQuant. kv-cache and the quantization algorithm is AutoSmoothQuant.
:::{note} !!! note
Note the quantization algorithm needs a JSON config file and the config file is located in Note the quantization algorithm needs a JSON config file and the config file is located in
[Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html), [Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html),
under the directory `examples/torch/language_modeling/llm_ptq/models`. For example, under the directory `examples/torch/language_modeling/llm_ptq/models`. For example,
AutoSmoothQuant config file for Llama is AutoSmoothQuant config file for Llama is
`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`. `examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
:::
```python ```python
from quark.torch.quantization import (Config, QuantizationConfig, from quark.torch.quantization import (Config, QuantizationConfig,
......
---
title: Supported Hardware
---
[](){ #quantization-supported-hardware }
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | AWS Inferentia | Google TPU |
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|------------------|--------------|
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ❌ | ✅︎ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware.
- ❌ indicates that the quantization method is not supported on the specified hardware.
!!! note
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
(reasoning-outputs)= ---
title: Reasoning Outputs
# Reasoning Outputs ---
[](){ #reasoning-outputs }
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions. vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
...@@ -17,10 +18,9 @@ vLLM currently supports the following reasoning models: ...@@ -17,10 +18,9 @@ vLLM currently supports the following reasoning models:
| [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | ❌ | ❌ | | [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | ❌ | ❌ |
| [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` | ✅ | | [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` | ✅ |
:::{note} !!! note
IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`. IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`. The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
:::
## Quickstart ## Quickstart
...@@ -167,12 +167,10 @@ client = OpenAI( ...@@ -167,12 +167,10 @@ client = OpenAI(
models = client.models.list() models = client.models.list()
model = models.data[0].id model = models.data[0].id
class People(BaseModel): class People(BaseModel):
name: str name: str
age: int age: int
json_schema = People.model_json_schema() json_schema = People.model_json_schema()
prompt = ("Generate a JSON with the name and age of one random person.") prompt = ("Generate a JSON with the name and age of one random person.")
......
(spec-decode)= ---
title: Speculative Decoding
---
[](){ #spec-decode }
# Speculative Decoding !!! warning
Please note that speculative decoding in vLLM is not yet optimized and does
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
:::{warning} !!! warning
Please note that speculative decoding in vLLM is not yet optimized and does Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
:::
:::{warning}
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
:::
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM. This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference. Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
...@@ -51,9 +50,8 @@ python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model ...@@ -51,9 +50,8 @@ python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model
--speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}' --speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
``` ```
:::{warning} !!! warning
Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now. Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now.
:::
Then use a client: Then use a client:
...@@ -255,7 +253,7 @@ speculative decoding, breaking down the guarantees into three key areas: ...@@ -255,7 +253,7 @@ speculative decoding, breaking down the guarantees into three key areas:
3. **vLLM Logprob Stability** 3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq). titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors: can occur due to following factors:
...@@ -264,7 +262,7 @@ can occur due to following factors: ...@@ -264,7 +262,7 @@ can occur due to following factors:
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially - **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability. due to non-deterministic behavior in batched operations or numerical instability.
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq). For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
## Resources for vLLM contributors ## Resources for vLLM contributors
......
(structured-outputs)= ---
title: Structured Outputs
# Structured Outputs ---
[](){ #structured-outputs }
vLLM supports the generation of structured outputs using vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or [xgrammar](https://github.com/mlc-ai/xgrammar) or
...@@ -20,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters: ...@@ -20,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters:
- `guided_grammar`: the output will follow the context free grammar. - `guided_grammar`: the output will follow the context free grammar.
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text. - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
You can see the complete list of supported parameters on the [OpenAI-Compatible Server](#openai-compatible-server) page. You can see the complete list of supported parameters on the [OpenAI-Compatible Server][openai-compatible-server] page.
Structured outputs are supported by default in the OpenAI-Compatible Server. You Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the may choose to specify the backend to use by setting the
...@@ -83,13 +84,11 @@ class CarType(str, Enum): ...@@ -83,13 +84,11 @@ class CarType(str, Enum):
truck = "Truck" truck = "Truck"
coupe = "Coupe" coupe = "Coupe"
class CarDescription(BaseModel): class CarDescription(BaseModel):
brand: str brand: str
model: str model: str
car_type: CarType car_type: CarType
json_schema = CarDescription.model_json_schema() json_schema = CarDescription.model_json_schema()
completion = client.chat.completions.create( completion = client.chat.completions.create(
...@@ -105,11 +104,10 @@ completion = client.chat.completions.create( ...@@ -105,11 +104,10 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content) print(completion.choices[0].message.content)
``` ```
:::{tip} !!! tip
While not strictly necessary, normally it´s better to indicate in the prompt the While not strictly necessary, normally it´s better to indicate in the prompt the
JSON schema and how the fields should be populated. This can improve the JSON schema and how the fields should be populated. This can improve the
results notably in most cases. results notably in most cases.
:::
Finally we have the `guided_grammar` option, which is probably the most Finally we have the `guided_grammar` option, which is probably the most
difficult to use, but it´s really powerful. It allows us to define complete difficult to use, but it´s really powerful. It allows us to define complete
...@@ -160,12 +158,10 @@ Here is a simple example demonstrating how to get structured output using Pydant ...@@ -160,12 +158,10 @@ Here is a simple example demonstrating how to get structured output using Pydant
from pydantic import BaseModel from pydantic import BaseModel
from openai import OpenAI from openai import OpenAI
class Info(BaseModel): class Info(BaseModel):
name: str name: str
age: int age: int
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy") client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse( completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct", model="meta-llama/Llama-3.1-8B-Instruct",
...@@ -199,17 +195,14 @@ from typing import List ...@@ -199,17 +195,14 @@ from typing import List
from pydantic import BaseModel from pydantic import BaseModel
from openai import OpenAI from openai import OpenAI
class Step(BaseModel): class Step(BaseModel):
explanation: str explanation: str
output: str output: str
class MathResponse(BaseModel): class MathResponse(BaseModel):
steps: list[Step] steps: list[Step]
final_answer: str final_answer: str
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy") client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse( completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct", model="meta-llama/Llama-3.1-8B-Instruct",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment