@@ -9,39 +9,41 @@ The main benefits are lower latency and memory usage.
...
@@ -9,39 +9,41 @@ The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq).
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq).
```console
```bash
pip install autoawq
pip install autoawq
```
```
After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
print(f'Model is quantized and saved at "{quant_path}"')
# Save quantized model
```
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
```
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
@@ -23,7 +23,7 @@ The FP8 types typically supported in hardware have two distinct representations,
...
@@ -23,7 +23,7 @@ The FP8 types typically supported in hardware have two distinct representations,
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console
```bash
pip install llmcompressor
pip install llmcompressor
```
```
...
@@ -58,28 +58,30 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r
...
@@ -58,28 +58,30 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
Install `vllm` and `lm-evaluation-harness` for evaluation:
Install `vllm` and `lm-evaluation-harness` for evaluation:
```console
```bash
pip install vllm lm-eval==0.4.4
pip install vllm lm-eval==0.4.4
```
```
...
@@ -97,9 +99,9 @@ Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
...
@@ -97,9 +99,9 @@ Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
!!! note
!!! note
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
```console
```bash
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
@@ -21,7 +21,7 @@ for more details on this and other advanced features.
...
@@ -21,7 +21,7 @@ for more details on this and other advanced features.
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
```console
```bash
pip install-U gptqmodel --no-build-isolation-v
pip install-U gptqmodel --no-build-isolation-v
```
```
...
@@ -31,34 +31,36 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t
...
@@ -31,34 +31,36 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
# increase `batch_size` to match gpu/vram specs to speed up quantization
model = GPTQModel.load(model_id, quant_config)
model.quantize(calibration_dataset,batch_size=2)
model.save(quant_path)
# increase `batch_size` to match gpu/vram specs to speed up quantization
```
model.quantize(calibration_dataset, batch_size=2)
model.save(quant_path)
```
## Running a quantized model with vLLM
## Running a quantized model with vLLM
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
# Select the quantization config, for example, FP8
# Load the model from HuggingFace
config=mtq.FP8_DEFAULT_CFG
model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>")
# Define a forward loop function for calibration
# Select the quantization config, for example, FP8
defforward_loop(model):
config = mtq.FP8_DEFAULT_CFG
fordataincalib_set:
model(data)
# PTQ with in-place replacement of quantized modules
# Define a forward loop function for calibration
model=mtq.quantize(model,config,forward_loop)
def forward_loop(model):
```
for data in calib_set:
model(data)
# PTQ with in-place replacement of quantized modules
model = mtq.quantize(model, config, forward_loop)
```
After the model is quantized, you can export it to a quantized checkpoint using the export API:
After the model is quantized, you can export it to a quantized checkpoint using the export API:
...
@@ -48,31 +50,33 @@ with torch.inference_mode():
...
@@ -48,31 +50,33 @@ with torch.inference_mode():
The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
```python
??? Code
fromvllmimportLLM,SamplingParams
defmain():
```python
from vllm import LLM, SamplingParams
model_id="nvidia/Llama-3.1-8B-Instruct-FP8"
def main():
# Ensure you specify quantization='modelopt' when loading the modelopt checkpoint
The above script will create a folder in your current directory containing your quantized model (e.g., `Llama-3.1-8B-Instruct-FP8-KV`) with calibrated scales.
The above script will create a folder in your current directory containing your quantized model (e.g., `Llama-3.1-8B-Instruct-FP8-KV`) with calibrated scales.
@@ -4,7 +4,7 @@ TorchAO is an architecture optimization library for PyTorch, it provides high pe
...
@@ -4,7 +4,7 @@ TorchAO is an architecture optimization library for PyTorch, it provides high pe
We recommend installing the latest torchao nightly with
We recommend installing the latest torchao nightly with
```console
```bash
# Install the latest TorchAO nightly build
# Install the latest TorchAO nightly build
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
pip install\
pip install\
...
@@ -15,26 +15,28 @@ pip install \
...
@@ -15,26 +15,28 @@ pip install \
## Quantizing HuggingFace Models
## Quantizing HuggingFace Models
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
```Python
??? Code
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
```Python
from torchao.quantization import Int8WeightOnlyConfig
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Meta-Llama-3-8B"
from torchao.quantization import Int8WeightOnlyConfig
Alternatively, you can use the [TorchAO Quantization space](https://huggingface.co/spaces/medmekk/TorchAO_Quantization) for quantizing models with a simple UI.
Alternatively, you can use the [TorchAO Quantization space](https://huggingface.co/spaces/medmekk/TorchAO_Quantization) for quantizing models with a simple UI.
The `reasoning_content` field contains the reasoning steps that led to the final conclusion, while the `content` field contains the final conclusion.
The `reasoning_content` field contains the reasoning steps that led to the final conclusion, while the `content` field contains the final conclusion.
...
@@ -68,164 +70,125 @@ The `reasoning_content` field contains the reasoning steps that led to the final
...
@@ -68,164 +70,125 @@ The `reasoning_content` field contains the reasoning steps that led to the final
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
```python
??? Code
fromopenaiimportOpenAI
```python
# Modify OpenAI's API key and API base to use vLLM's API server.
from openai import OpenAI
openai_api_key="EMPTY"
openai_api_base="http://localhost:8000/v1"
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
client=OpenAI(
openai_api_base = "http://localhost:8000/v1"
api_key=openai_api_key,
base_url=openai_api_base,
client = OpenAI(
)
api_key=openai_api_key,
base_url=openai_api_base,
models=client.models.list()
)
model=models.data[0].id
models = client.models.list()
messages=[{"role":"user","content":"9.11 and 9.8, which is greater?"}]
model = models.data[0].id
# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
# For Qwen3 series, if you want to disable thinking in reasoning mode, add:
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
Remember to check whether the `reasoning_content` exists in the response before accessing it. You could checkout the [example](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py).
Remember to check whether the `reasoning_content` exists in the response before accessing it. You could checkout the [example](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py).
## Structured output
The reasoning content is also available in the structured output. The structured output engine like `xgrammar` will use the reasoning content to generate structured output. It is only supported in v0 engine now.
The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
Extract reasoning content from a complete model-generated string.
Used for non-streaming responses where we have the entire model response
available before sending to the client.
Parameters:
model_output: str
The model-generated string to extract reasoning content from.
request: ChatCompletionRequest
The request object that was used to generate the model_output.
Returns:
tuple[Optional[str], Optional[str]]
A tuple containing the reasoning content and the content.
"""
```
defextract_reasoning_content(
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
Extract reasoning content from a complete model-generated string.
Used for non-streaming responses where we have the entire model response
available before sending to the client.
Parameters:
model_output: str
The model-generated string to extract reasoning content from.
request: ChatCompletionRequest
??? Code
The request object that was used to generate the model_output.
Returns:
```python
tuple[Optional[str], Optional[str]]
@dataclass
A tuple containing the reasoning content and the content.
class DeepSeekReasoner(Reasoner):
"""
"""
```
Reasoner for DeepSeek R series models.
"""
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
The structured output engine like [xgrammar](https://github.com/mlc-ai/xgrammar) will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
The structured output engine like [xgrammar](https://github.com/mlc-ai/xgrammar) will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
Note that these speculative models currently need to be run without tensor parallelism, although
Note that these speculative models currently need to be run without tensor parallelism, although
it is possible to run the main model using tensor parallelism (see example above). Since the
it is possible to run the main model using tensor parallelism (see example above). Since the
...
@@ -177,31 +185,34 @@ A variety of speculative models of this type are available on HF hub:
...
@@ -177,31 +185,34 @@ A variety of speculative models of this type are available on HF hub:
The following code configures vLLM to use speculative decoding where proposals are generated by
The following code configures vLLM to use speculative decoding where proposals are generated by
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
```python
??? Code
completion=client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct",
```python
messages=[
completion = client.chat.completions.create(
{
model=model,
"role":"user",
messages=[
"content":"Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
{
}
"role": "user",
],
"content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
@@ -97,6 +99,14 @@ vLLM supports the `tool_choice='required'` option in the chat completion API. Si
...
@@ -97,6 +99,14 @@ vLLM supports the `tool_choice='required'` option in the chat completion API. Si
When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter.
When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter.
## None Function Calling
vLLM supports the `tool_choice='none'` option in the chat completion API. When this option is set, the model will not generate any tool calls and will respond with regular text content only, even if tools are defined in the request.
By default, when `tool_choice='none'` is specified, vLLM excludes tool definitions from the prompt to optimize context usage. To include tool definitions even with `tool_choice='none'`, use the `--expand-tools-even-if-tool-choice-none` option.
Note: This behavior will change in v0.10.0, where tool definitions will be included by default even with `tool_choice='none'`.
## Automatic Function Calling
## Automatic Function Calling
To enable this feature, you should set the following flags:
To enable this feature, you should set the following flags:
...
@@ -226,6 +236,25 @@ AI21's Jamba-1.5 models are supported.
...
@@ -226,6 +236,25 @@ AI21's Jamba-1.5 models are supported.
Flags: `--tool-call-parser jamba`
Flags: `--tool-call-parser jamba`
### xLAM Models (`xlam`)
The xLAM tool parser is designed to support models that generate tool calls in various JSON formats. It detects function calls in several different output styles:
1. Direct JSON arrays: Output strings that are JSON arrays starting with `[` and ending with `]`
2. Thinking tags: Using `<think>...</think>` tags containing JSON arrays
3. Code blocks: JSON in code blocks (```json ...```)
4. Tool calls tags: Using `[TOOL_CALLS]` or `<tool_call>...</tool_call>` tags
Parallel function calls are supported, and the parser can effectively separate text content from tool calls.
* For Llama-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_llama.jinja`
* For Qwen-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_qwen.jinja`
### Qwen Models
### Qwen Models
For Qwen2.5, the chat template in tokenizer_config.json has already included support for the Hermes-style tool use. Therefore, you can use the `hermes` parser to enable tool calls for Qwen models. For more detailed information, please refer to the official [Qwen documentation](https://qwen.readthedocs.io/en/latest/framework/function_call.html#vllm)
For Qwen2.5, the chat template in tokenizer_config.json has already included support for the Hermes-style tool use. Therefore, you can use the `hermes` parser to enable tool calls for Qwen models. For more detailed information, please refer to the official [Qwen documentation](https://qwen.readthedocs.io/en/latest/framework/function_call.html#vllm)
...
@@ -235,6 +264,15 @@ For Qwen2.5, the chat template in tokenizer_config.json has already included sup
...
@@ -235,6 +264,15 @@ For Qwen2.5, the chat template in tokenizer_config.json has already included sup
Flags: `--tool-call-parser hermes`
Flags: `--tool-call-parser hermes`
### MiniMax Models (`minimax_m1`)
Supported models:
*`MiniMaxAi/MiniMax-M1-40k` (use with <gh-file:examples/tool_chat_template_minimax.jinja>)
*`MiniMaxAi/MiniMax-M1-80k` (use with <gh-file:examples/tool_chat_template_minimax.jinja>)