Commit a40a133c authored by zhuwenwen's avatar zhuwenwen
Browse files

Merge tag 'v0.9.2' into v0.9.2-dev

parents 1a9a61d7 a5dd03c1
...@@ -7,16 +7,16 @@ Quantization trades off model precision for smaller memory footprint, allowing l ...@@ -7,16 +7,16 @@ Quantization trades off model precision for smaller memory footprint, allowing l
Contents: Contents:
- [Supported_Hardware](supported_hardware.md) - [Supported Hardware](supported_hardware.md)
- [Auto_Awq](auto_awq.md) - [AutoAWQ](auto_awq.md)
- [Bnb](bnb.md) - [BitsAndBytes](bnb.md)
- [Bitblas](bitblas.md) - [BitBLAS](bitblas.md)
- [Gguf](gguf.md) - [GGUF](gguf.md)
- [Gptqmodel](gptqmodel.md) - [GPTQModel](gptqmodel.md)
- [Int4](int4.md) - [INT4 W4A16](int4.md)
- [Int8](int8.md) - [INT8 W8A8](int8.md)
- [Fp8](fp8.md) - [FP8 W8A8](fp8.md)
- [Modelopt](modelopt.md) - [NVIDIA TensorRT Model Optimizer](modelopt.md)
- [Quark](quark.md) - [AMD Quark](quark.md)
- [Quantized_Kvcache](quantized_kvcache.md) - [Quantized KV Cache](quantized_kvcache.md)
- [Torchao](torchao.md) - [TorchAO](torchao.md)
...@@ -9,39 +9,41 @@ The main benefits are lower latency and memory usage. ...@@ -9,39 +9,41 @@ The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq). You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq).
```console ```bash
pip install autoawq pip install autoawq
``` ```
After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`: After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
```python ??? Code
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'mistralai/Mistral-7B-Instruct-v0.2' ```python
quant_path = 'mistral-instruct-v0.2-awq' from awq import AutoAWQForCausalLM
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } from transformers import AutoTokenizer
# Load model model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
model = AutoAWQForCausalLM.from_pretrained( quant_path = 'mistral-instruct-v0.2-awq'
model_path, **{"low_cpu_mem_usage": True, "use_cache": False} quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize # Load model
model.quantize(tokenizer, quant_config=quant_config) model = AutoAWQForCausalLM.from_pretrained(
model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Save quantized model # Quantize
model.save_quantized(quant_path) model.quantize(tokenizer, quant_config=quant_config)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"') # Save quantized model
``` model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
```
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command: To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
```console ```bash
python examples/offline_inference/llm_engine_example.py \ python examples/offline_inference/llm_engine_example.py \
--model TheBloke/Llama-2-7b-Chat-AWQ \ --model TheBloke/Llama-2-7b-Chat-AWQ \
--quantization awq --quantization awq
...@@ -49,27 +51,29 @@ python examples/offline_inference/llm_engine_example.py \ ...@@ -49,27 +51,29 @@ python examples/offline_inference/llm_engine_example.py \
AWQ models are also supported directly through the LLM entrypoint: AWQ models are also supported directly through the LLM entrypoint:
```python ??? Code
from vllm import LLM, SamplingParams
```python
# Sample prompts. from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is", # Sample prompts.
"The president of the United States is", prompts = [
"The capital of France is", "Hello, my name is",
"The future of AI is", "The president of the United States is",
] "The capital of France is",
# Create a sampling params object. "The future of AI is",
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) ]
# Create a sampling params object.
# Create an LLM. sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
# Generate texts from the prompts. The output is a list of RequestOutput objects # Create an LLM.
# that contain the prompt, generated text, and other information. llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
outputs = llm.generate(prompts, sampling_params) # Generate texts from the prompts. The output is a list of RequestOutput objects
# Print the outputs. # that contain the prompt, generated text, and other information.
for output in outputs: outputs = llm.generate(prompts, sampling_params)
prompt = output.prompt # Print the outputs.
generated_text = output.outputs[0].text for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") prompt = output.prompt
``` generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
...@@ -12,7 +12,7 @@ vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more effic ...@@ -12,7 +12,7 @@ vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more effic
Below are the steps to utilize BitBLAS with vLLM. Below are the steps to utilize BitBLAS with vLLM.
```console ```bash
pip install bitblas>=0.1.0 pip install bitblas>=0.1.0
``` ```
...@@ -43,17 +43,19 @@ llm = LLM( ...@@ -43,17 +43,19 @@ llm = LLM(
## Read gptq format checkpoint ## Read gptq format checkpoint
```python ??? Code
from vllm import LLM
import torch ```python
from vllm import LLM
# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint. import torch
model_id = "hxbgsyxh/llama-13b-4bit-g-1"
llm = LLM( # "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
model=model_id, model_id = "hxbgsyxh/llama-13b-4bit-g-1"
dtype=torch.float16, llm = LLM(
trust_remote_code=True, model=model_id,
quantization="bitblas", dtype=torch.float16,
max_model_len=1024 trust_remote_code=True,
) quantization="bitblas",
``` max_model_len=1024
)
```
...@@ -9,8 +9,8 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal ...@@ -9,8 +9,8 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
Below are the steps to utilize BitsAndBytes with vLLM. Below are the steps to utilize BitsAndBytes with vLLM.
```console ```bash
pip install bitsandbytes>=0.45.3 pip install bitsandbytes>=0.46.1
``` ```
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint. vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
...@@ -54,6 +54,6 @@ llm = LLM( ...@@ -54,6 +54,6 @@ llm = LLM(
Append the following to your model arguments for 4bit inflight quantization: Append the following to your model arguments for 4bit inflight quantization:
```console ```bash
--quantization bitsandbytes --quantization bitsandbytes
``` ```
...@@ -23,7 +23,7 @@ The FP8 types typically supported in hardware have two distinct representations, ...@@ -23,7 +23,7 @@ The FP8 types typically supported in hardware have two distinct representations,
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library: To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console ```bash
pip install llmcompressor pip install llmcompressor
``` ```
...@@ -58,28 +58,30 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r ...@@ -58,28 +58,30 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow. Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
```python ??? Code
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# Configure the simple PTQ quantization ```python
recipe = QuantizationModifier( from llmcompressor.transformers import oneshot
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]) from llmcompressor.modifiers.quantization import QuantizationModifier
# Apply the quantization algorithm. # Configure the simple PTQ quantization
oneshot(model=model, recipe=recipe) recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
# Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic # Apply the quantization algorithm.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" oneshot(model=model, recipe=recipe)
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR) # Save the model: Meta-Llama-3-8B-Instruct-FP8-Dynamic
``` SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)
```
### 3. Evaluating Accuracy ### 3. Evaluating Accuracy
Install `vllm` and `lm-evaluation-harness` for evaluation: Install `vllm` and `lm-evaluation-harness` for evaluation:
```console ```bash
pip install vllm lm-eval==0.4.4 pip install vllm lm-eval==0.4.4
``` ```
...@@ -97,9 +99,9 @@ Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`): ...@@ -97,9 +99,9 @@ Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
!!! note !!! note
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations. Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
```console ```bash
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
$ lm_eval \ lm_eval \
--model vllm \ --model vllm \
--model_args pretrained=$MODEL,add_bos_token=True \ --model_args pretrained=$MODEL,add_bos_token=True \
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250 --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
......
...@@ -11,7 +11,7 @@ title: GGUF ...@@ -11,7 +11,7 @@ title: GGUF
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command: To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
```console ```bash
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion. # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
...@@ -20,7 +20,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \ ...@@ -20,7 +20,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs: You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
```console ```bash
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion. # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
...@@ -32,7 +32,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \ ...@@ -32,7 +32,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
```console ```bash
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path # If you model is not supported by huggingface you can manually provide a huggingface compatible config path
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
...@@ -41,42 +41,44 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \ ...@@ -41,42 +41,44 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
You can also use the GGUF model directly through the LLM entrypoint: You can also use the GGUF model directly through the LLM entrypoint:
```python ??? Code
from vllm import LLM, SamplingParams
```python
# In this script, we demonstrate how to pass input to the chat method: from vllm import LLM, SamplingParams
conversation = [
{ # In this script, we demonstrate how to pass input to the chat method:
"role": "system", conversation = [
"content": "You are a helpful assistant" {
}, "role": "system",
{ "content": "You are a helpful assistant"
"role": "user", },
"content": "Hello" {
}, "role": "user",
{ "content": "Hello"
"role": "assistant", },
"content": "Hello! How can I assist you today?" {
}, "role": "assistant",
{ "content": "Hello! How can I assist you today?"
"role": "user", },
"content": "Write an essay about the importance of higher education.", {
}, "role": "user",
] "content": "Write an essay about the importance of higher education.",
},
# Create a sampling params object. ]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create a sampling params object.
# Create an LLM. sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0") # Create an LLM.
# Generate texts from the prompts. The output is a list of RequestOutput objects llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
# that contain the prompt, generated text, and other information. tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
outputs = llm.chat(conversation, sampling_params) # Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
# Print the outputs. outputs = llm.chat(conversation, sampling_params)
for output in outputs:
prompt = output.prompt # Print the outputs.
generated_text = output.outputs[0].text for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") prompt = output.prompt
``` generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
...@@ -21,7 +21,7 @@ for more details on this and other advanced features. ...@@ -21,7 +21,7 @@ for more details on this and other advanced features.
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq). You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
```console ```bash
pip install -U gptqmodel --no-build-isolation -v pip install -U gptqmodel --no-build-isolation -v
``` ```
...@@ -31,34 +31,36 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t ...@@ -31,34 +31,36 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`: Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
```python ??? Code
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
model_id = "meta-llama/Llama-3.2-1B-Instruct" ```python
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit" from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
calibration_dataset = load_dataset( model_id = "meta-llama/Llama-3.2-1B-Instruct"
"allenai/c4", quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
quant_config = QuantizeConfig(bits=4, group_size=128) calibration_dataset = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(1024))["text"]
model = GPTQModel.load(model_id, quant_config) quant_config = QuantizeConfig(bits=4, group_size=128)
# increase `batch_size` to match gpu/vram specs to speed up quantization model = GPTQModel.load(model_id, quant_config)
model.quantize(calibration_dataset, batch_size=2)
model.save(quant_path) # increase `batch_size` to match gpu/vram specs to speed up quantization
``` model.quantize(calibration_dataset, batch_size=2)
model.save(quant_path)
```
## Running a quantized model with vLLM ## Running a quantized model with vLLM
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command: To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
```console ```bash
python examples/offline_inference/llm_engine_example.py \ python examples/offline_inference/llm_engine_example.py \
--model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2 --model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
``` ```
...@@ -67,32 +69,34 @@ python examples/offline_inference/llm_engine_example.py \ ...@@ -67,32 +69,34 @@ python examples/offline_inference/llm_engine_example.py \
GPTQModel quantized models are also supported directly through the LLM entrypoint: GPTQModel quantized models are also supported directly through the LLM entrypoint:
```python ??? Code
from vllm import LLM, SamplingParams
```python
# Sample prompts. from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is", # Sample prompts.
"The president of the United States is", prompts = [
"The capital of France is", "Hello, my name is",
"The future of AI is", "The president of the United States is",
] "The capital of France is",
"The future of AI is",
# Create a sampling params object. ]
sampling_params = SamplingParams(temperature=0.6, top_p=0.9)
# Create a sampling params object.
# Create an LLM. sampling_params = SamplingParams(temperature=0.6, top_p=0.9)
llm = LLM(model="ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2")
# Create an LLM.
# Generate texts from the prompts. The output is a list of RequestOutput objects llm = LLM(model="ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2")
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params) # Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
# Print the outputs. outputs = llm.generate(prompts, sampling_params)
print("-"*50)
for output in outputs: # Print the outputs.
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
print("-"*50) print("-"*50)
``` for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
print("-"*50)
```
...@@ -14,13 +14,13 @@ Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs re ...@@ -14,13 +14,13 @@ Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs re
To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library: To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console ```bash
pip install llmcompressor pip install llmcompressor
``` ```
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation: Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
```console ```bash
pip install vllm lm-eval==0.4.4 pip install vllm lm-eval==0.4.4
``` ```
...@@ -53,51 +53,55 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd ...@@ -53,51 +53,55 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd
It's best to use calibration data that closely matches your deployment data. It's best to use calibration data that closely matches your deployment data.
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`: For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
```python ??? Code
from datasets import load_dataset
NUM_CALIBRATION_SAMPLES = 512 ```python
MAX_SEQUENCE_LENGTH = 2048 from datasets import load_dataset
# Load and preprocess the dataset NUM_CALIBRATION_SAMPLES = 512
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") MAX_SEQUENCE_LENGTH = 2048
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def preprocess(example): # Load and preprocess the dataset
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.map(preprocess) ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def tokenize(sample): def preprocess(example):
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(tokenize, remove_columns=ds.column_names) ds = ds.map(preprocess)
```
def tokenize(sample):
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)
```
### 3. Applying Quantization ### 3. Applying Quantization
Now, apply the quantization algorithms: Now, apply the quantization algorithms:
```python ??? Code
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
# Configure the quantization algorithms
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128 ```python
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128" from llmcompressor.transformers import oneshot
model.save_pretrained(SAVE_DIR, save_compressed=True) from llmcompressor.modifiers.quantization import GPTQModifier
tokenizer.save_pretrained(SAVE_DIR) from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
```
# Configure the quantization algorithms
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save the compressed model: Meta-Llama-3-8B-Instruct-W4A16-G128
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```
This process creates a W4A16 model with weights quantized to 4-bit integers. This process creates a W4A16 model with weights quantized to 4-bit integers.
...@@ -112,8 +116,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128") ...@@ -112,8 +116,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
To evaluate accuracy, you can use `lm_eval`: To evaluate accuracy, you can use `lm_eval`:
```console ```bash
$ lm_eval --model vllm \ lm_eval --model vllm \
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \ --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \
--tasks gsm8k \ --tasks gsm8k \
--num_fewshot 5 \ --num_fewshot 5 \
...@@ -137,34 +141,36 @@ $ lm_eval --model vllm \ ...@@ -137,34 +141,36 @@ $ lm_eval --model vllm \
The following is an example of an expanded quantization recipe you can tune to your own use case: The following is an example of an expanded quantization recipe you can tune to your own use case:
```python ??? Code
from compressed_tensors.quantization import (
QuantizationArgs, ```python
QuantizationScheme, from compressed_tensors.quantization import (
QuantizationStrategy, QuantizationArgs,
QuantizationType, QuantizationScheme,
) QuantizationStrategy,
recipe = GPTQModifier( QuantizationType,
targets="Linear", )
config_groups={ recipe = GPTQModifier(
"config_group": QuantizationScheme( targets="Linear",
targets=["Linear"], config_groups={
weights=QuantizationArgs( "config_group": QuantizationScheme(
num_bits=4, targets=["Linear"],
type=QuantizationType.INT, weights=QuantizationArgs(
strategy=QuantizationStrategy.GROUP, num_bits=4,
group_size=128, type=QuantizationType.INT,
symmetric=True, strategy=QuantizationStrategy.GROUP,
dynamic=False, group_size=128,
actorder="weight", symmetric=True,
dynamic=False,
actorder="weight",
),
), ),
), },
}, ignore=["lm_head"],
ignore=["lm_head"], update_size=NUM_CALIBRATION_SAMPLES,
update_size=NUM_CALIBRATION_SAMPLES, dampening_frac=0.01
dampening_frac=0.01 )
) ```
```
## Troubleshooting and Support ## Troubleshooting and Support
......
...@@ -15,13 +15,13 @@ Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs re ...@@ -15,13 +15,13 @@ Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs re
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library: To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console ```bash
pip install llmcompressor pip install llmcompressor
``` ```
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation: Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
```console ```bash
pip install vllm lm-eval==0.4.4 pip install vllm lm-eval==0.4.4
``` ```
...@@ -54,54 +54,60 @@ When quantizing activations to INT8, you need sample data to estimate the activa ...@@ -54,54 +54,60 @@ When quantizing activations to INT8, you need sample data to estimate the activa
It's best to use calibration data that closely matches your deployment data. It's best to use calibration data that closely matches your deployment data.
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`: For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
```python ??? Code
from datasets import load_dataset
NUM_CALIBRATION_SAMPLES = 512 ```python
MAX_SEQUENCE_LENGTH = 2048 from datasets import load_dataset
# Load and preprocess the dataset NUM_CALIBRATION_SAMPLES = 512
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") MAX_SEQUENCE_LENGTH = 2048
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def preprocess(example): # Load and preprocess the dataset
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.map(preprocess) ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def tokenize(sample): def preprocess(example):
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False) return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(tokenize, remove_columns=ds.column_names) ds = ds.map(preprocess)
```
def tokenize(sample):
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)
```
</details>
### 3. Applying Quantization ### 3. Applying Quantization
Now, apply the quantization algorithms: Now, apply the quantization algorithms:
```python ??? Code
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier ```python
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
# Configure the quantization algorithms from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
recipe = [
SmoothQuantModifier(smoothing_strength=0.8), # Configure the quantization algorithms
GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]), recipe = [
] SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
# Apply quantization ]
oneshot(
model=model, # Apply quantization
dataset=ds, oneshot(
recipe=recipe, model=model,
max_seq_length=MAX_SEQUENCE_LENGTH, dataset=ds,
num_calibration_samples=NUM_CALIBRATION_SAMPLES, recipe=recipe,
) max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
# Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token )
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
model.save_pretrained(SAVE_DIR, save_compressed=True) # Save the compressed model: Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token
tokenizer.save_pretrained(SAVE_DIR) SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
``` model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```
This process creates a W8A8 model with weights and activations quantized to 8-bit integers. This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
...@@ -116,8 +122,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token") ...@@ -116,8 +122,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
To evaluate accuracy, you can use `lm_eval`: To evaluate accuracy, you can use `lm_eval`:
```console ```bash
$ lm_eval --model vllm \ lm_eval --model vllm \
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \ --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
--tasks gsm8k \ --tasks gsm8k \
--num_fewshot 5 \ --num_fewshot 5 \
......
...@@ -4,7 +4,7 @@ The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-O ...@@ -4,7 +4,7 @@ The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-O
We recommend installing the library with: We recommend installing the library with:
```console ```bash
pip install nvidia-modelopt pip install nvidia-modelopt
``` ```
...@@ -14,24 +14,26 @@ You can quantize HuggingFace models using the example scripts provided in the Te ...@@ -14,24 +14,26 @@ You can quantize HuggingFace models using the example scripts provided in the Te
Below is an example showing how to quantize a model using modelopt's PTQ API: Below is an example showing how to quantize a model using modelopt's PTQ API:
```python ??? Code
import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM
# Load the model from HuggingFace ```python
model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>") import modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM
# Select the quantization config, for example, FP8 # Load the model from HuggingFace
config = mtq.FP8_DEFAULT_CFG model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>")
# Define a forward loop function for calibration # Select the quantization config, for example, FP8
def forward_loop(model): config = mtq.FP8_DEFAULT_CFG
for data in calib_set:
model(data)
# PTQ with in-place replacement of quantized modules # Define a forward loop function for calibration
model = mtq.quantize(model, config, forward_loop) def forward_loop(model):
``` for data in calib_set:
model(data)
# PTQ with in-place replacement of quantized modules
model = mtq.quantize(model, config, forward_loop)
```
After the model is quantized, you can export it to a quantized checkpoint using the export API: After the model is quantized, you can export it to a quantized checkpoint using the export API:
...@@ -48,31 +50,33 @@ with torch.inference_mode(): ...@@ -48,31 +50,33 @@ with torch.inference_mode():
The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM: The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
```python ??? Code
from vllm import LLM, SamplingParams
def main(): ```python
from vllm import LLM, SamplingParams
model_id = "nvidia/Llama-3.1-8B-Instruct-FP8" def main():
# Ensure you specify quantization='modelopt' when loading the modelopt checkpoint
llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True)
sampling_params = SamplingParams(temperature=0.8, top_p=0.9) model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
# Ensure you specify quantization='modelopt' when loading the modelopt checkpoint
llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True)
prompts = [ sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
outputs = llm.generate(prompts, sampling_params) prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
for output in outputs: outputs = llm.generate(prompts, sampling_params)
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == "__main__": for output in outputs:
main() prompt = output.prompt
``` generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if __name__ == "__main__":
main()
```
...@@ -35,20 +35,22 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades ...@@ -35,20 +35,22 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades
Here is an example of how to enable FP8 quantization: Here is an example of how to enable FP8 quantization:
```python ??? Code
# To calculate kv cache scales on the fly enable the calculate_kv_scales
# parameter
from vllm import LLM, SamplingParams ```python
# To calculate kv cache scales on the fly enable the calculate_kv_scales
# parameter
sampling_params = SamplingParams(temperature=0.7, top_p=0.8) from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
kv_cache_dtype="fp8", sampling_params = SamplingParams(temperature=0.7, top_p=0.8)
calculate_kv_scales=True) llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
prompt = "London is the capital of" kv_cache_dtype="fp8",
out = llm.generate(prompt, sampling_params)[0].outputs[0].text calculate_kv_scales=True)
print(out) prompt = "London is the capital of"
``` out = llm.generate(prompt, sampling_params)[0].outputs[0].text
print(out)
```
The `kv_cache_dtype` argument specifies the data type for KV cache storage: The `kv_cache_dtype` argument specifies the data type for KV cache storage:
- `"auto"`: Uses the model's default "unquantized" data type - `"auto"`: Uses the model's default "unquantized" data type
...@@ -63,7 +65,7 @@ For optimal model quality when using FP8 KV Cache, we recommend using calibrated ...@@ -63,7 +65,7 @@ For optimal model quality when using FP8 KV Cache, we recommend using calibrated
First, install the required dependencies: First, install the required dependencies:
```console ```bash
pip install llmcompressor pip install llmcompressor
``` ```
...@@ -71,67 +73,69 @@ pip install llmcompressor ...@@ -71,67 +73,69 @@ pip install llmcompressor
Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern): Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern):
```python ??? Code
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer ```python
from llmcompressor.transformers import oneshot from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
# Select model and load it from llmcompressor.transformers import oneshot
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto") # Select model and load it
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", torch_dtype="auto")
# Select calibration dataset tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft" # Select calibration dataset
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
# Configure calibration parameters DATASET_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 512 # 512 samples is a good starting point
MAX_SEQUENCE_LENGTH = 2048 # Configure calibration parameters
NUM_CALIBRATION_SAMPLES = 512 # 512 samples is a good starting point
# Load and preprocess dataset MAX_SEQUENCE_LENGTH = 2048
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) # Load and preprocess dataset
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
def process_and_tokenize(example): ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
return tokenizer( def process_and_tokenize(example):
text, text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
padding=False, return tokenizer(
max_length=MAX_SEQUENCE_LENGTH, text,
truncation=True, padding=False,
add_special_tokens=False, max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(process_and_tokenize, remove_columns=ds.column_names)
# Configure quantization settings
recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
kv_cache_scheme:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
"""
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
) )
ds = ds.map(process_and_tokenize, remove_columns=ds.column_names) # Save quantized model: Llama-3.1-8B-Instruct-FP8-KV
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
# Configure quantization settings model.save_pretrained(SAVE_DIR, save_compressed=True)
recipe = """ tokenizer.save_pretrained(SAVE_DIR)
quant_stage: ```
quant_modifiers:
QuantizationModifier:
kv_cache_scheme:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
"""
# Apply quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save quantized model: Llama-3.1-8B-Instruct-FP8-KV
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-KV"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```
The above script will create a folder in your current directory containing your quantized model (e.g., `Llama-3.1-8B-Instruct-FP8-KV`) with calibrated scales. The above script will create a folder in your current directory containing your quantized model (e.g., `Llama-3.1-8B-Instruct-FP8-KV`) with calibrated scales.
......
--- ---
title: AMD QUARK title: AMD Quark
--- ---
[](){ #quark } [](){ #quark }
...@@ -13,7 +13,7 @@ AWQ, GPTQ, Rotation and SmoothQuant. ...@@ -13,7 +13,7 @@ AWQ, GPTQ, Rotation and SmoothQuant.
Before quantizing models, you need to install Quark. The latest release of Quark can be installed with pip: Before quantizing models, you need to install Quark. The latest release of Quark can be installed with pip:
```console ```bash
pip install amd-quark pip install amd-quark
``` ```
...@@ -22,13 +22,13 @@ for more installation details. ...@@ -22,13 +22,13 @@ for more installation details.
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation: Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
```console ```bash
pip install vllm lm-eval==0.4.4 pip install vllm lm-eval==0.4.4
``` ```
## Quantization Process ## Quantization Process
After installing Quark, we will use an example to illustrate how to use Quark. After installing Quark, we will use an example to illustrate how to use Quark.
The Quark quantization process can be listed for 5 steps as below: The Quark quantization process can be listed for 5 steps as below:
1. Load the model 1. Load the model
...@@ -42,20 +42,22 @@ The Quark quantization process can be listed for 5 steps as below: ...@@ -42,20 +42,22 @@ The Quark quantization process can be listed for 5 steps as below:
Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index) Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index)
to fetch model and tokenizer. to fetch model and tokenizer.
```python ??? Code
from transformers import AutoTokenizer, AutoModelForCausalLM
MODEL_ID = "meta-llama/Llama-2-70b-chat-hf" ```python
MAX_SEQ_LEN = 512 from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained( MODEL_ID = "meta-llama/Llama-2-70b-chat-hf"
MODEL_ID, device_map="auto", torch_dtype="auto", MAX_SEQ_LEN = 512
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, model_max_length=MAX_SEQ_LEN) model = AutoModelForCausalLM.from_pretrained(
tokenizer.pad_token = tokenizer.eos_token MODEL_ID, device_map="auto", torch_dtype="auto",
``` )
model.eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, model_max_length=MAX_SEQ_LEN)
tokenizer.pad_token = tokenizer.eos_token
```
### 2. Prepare the Calibration Dataloader ### 2. Prepare the Calibration Dataloader
...@@ -63,22 +65,24 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic ...@@ -63,22 +65,24 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic
to load calibration data. For more details about how to use calibration datasets efficiently, please refer to load calibration data. For more details about how to use calibration datasets efficiently, please refer
to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html). to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html).
```python ??? Code
from datasets import load_dataset
from torch.utils.data import DataLoader
BATCH_SIZE = 1 ```python
NUM_CALIBRATION_DATA = 512 from datasets import load_dataset
from torch.utils.data import DataLoader
# Load the dataset and get calibration data. BATCH_SIZE = 1
dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation") NUM_CALIBRATION_DATA = 512
text_data = dataset["text"][:NUM_CALIBRATION_DATA]
tokenized_outputs = tokenizer(text_data, return_tensors="pt", # Load the dataset and get calibration data.
padding=True, truncation=True, max_length=MAX_SEQ_LEN) dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation")
calib_dataloader = DataLoader(tokenized_outputs['input_ids'], text_data = dataset["text"][:NUM_CALIBRATION_DATA]
batch_size=BATCH_SIZE, drop_last=True)
``` tokenized_outputs = tokenizer(text_data, return_tensors="pt",
padding=True, truncation=True, max_length=MAX_SEQ_LEN)
calib_dataloader = DataLoader(tokenized_outputs['input_ids'],
batch_size=BATCH_SIZE, drop_last=True)
```
### 3. Set the Quantization Configuration ### 3. Set the Quantization Configuration
...@@ -94,42 +98,44 @@ kv-cache and the quantization algorithm is AutoSmoothQuant. ...@@ -94,42 +98,44 @@ kv-cache and the quantization algorithm is AutoSmoothQuant.
AutoSmoothQuant config file for Llama is AutoSmoothQuant config file for Llama is
`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`. `examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
```python ??? Code
from quark.torch.quantization import (Config, QuantizationConfig,
FP8E4M3PerTensorSpec, ```python
load_quant_algo_config_from_file) from quark.torch.quantization import (Config, QuantizationConfig,
FP8E4M3PerTensorSpec,
# Define fp8/per-tensor/static spec. load_quant_algo_config_from_file)
FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(observer_method="min_max",
is_dynamic=False).to_quantization_spec() # Define fp8/per-tensor/static spec.
FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(observer_method="min_max",
# Define global quantization config, input tensors and weight apply FP8_PER_TENSOR_SPEC. is_dynamic=False).to_quantization_spec()
global_quant_config = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC,
weight=FP8_PER_TENSOR_SPEC) # Define global quantization config, input tensors and weight apply FP8_PER_TENSOR_SPEC.
global_quant_config = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC,
# Define quantization config for kv-cache layers, output tensors apply FP8_PER_TENSOR_SPEC. weight=FP8_PER_TENSOR_SPEC)
KV_CACHE_SPEC = FP8_PER_TENSOR_SPEC
kv_cache_layer_names_for_llama = ["*k_proj", "*v_proj"] # Define quantization config for kv-cache layers, output tensors apply FP8_PER_TENSOR_SPEC.
kv_cache_quant_config = {name : KV_CACHE_SPEC = FP8_PER_TENSOR_SPEC
QuantizationConfig(input_tensors=global_quant_config.input_tensors, kv_cache_layer_names_for_llama = ["*k_proj", "*v_proj"]
weight=global_quant_config.weight, kv_cache_quant_config = {name :
output_tensors=KV_CACHE_SPEC) QuantizationConfig(input_tensors=global_quant_config.input_tensors,
for name in kv_cache_layer_names_for_llama} weight=global_quant_config.weight,
layer_quant_config = kv_cache_quant_config.copy() output_tensors=KV_CACHE_SPEC)
for name in kv_cache_layer_names_for_llama}
# Define algorithm config by config file. layer_quant_config = kv_cache_quant_config.copy()
LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE =
'examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json' # Define algorithm config by config file.
algo_config = load_quant_algo_config_from_file(LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE) LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE =
'examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json'
EXCLUDE_LAYERS = ["lm_head"] algo_config = load_quant_algo_config_from_file(LLAMA_AUTOSMOOTHQUANT_CONFIG_FILE)
quant_config = Config(
global_quant_config=global_quant_config, EXCLUDE_LAYERS = ["lm_head"]
layer_quant_config=layer_quant_config, quant_config = Config(
kv_cache_quant_config=kv_cache_quant_config, global_quant_config=global_quant_config,
exclude=EXCLUDE_LAYERS, layer_quant_config=layer_quant_config,
algo_config=algo_config) kv_cache_quant_config=kv_cache_quant_config,
``` exclude=EXCLUDE_LAYERS,
algo_config=algo_config)
```
### 4. Quantize the Model and Export ### 4. Quantize the Model and Export
...@@ -139,68 +145,72 @@ HuggingFace `safetensors`, you can refer to ...@@ -139,68 +145,72 @@ HuggingFace `safetensors`, you can refer to
[HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html) [HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html)
for more exporting format details. for more exporting format details.
```python ??? Code
import torch
from quark.torch import ModelQuantizer, ModelExporter ```python
from quark.torch.export import ExporterConfig, JsonExporterConfig import torch
from quark.torch import ModelQuantizer, ModelExporter
# Apply quantization. from quark.torch.export import ExporterConfig, JsonExporterConfig
quantizer = ModelQuantizer(quant_config)
quant_model = quantizer.quantize_model(model, calib_dataloader) # Apply quantization.
quantizer = ModelQuantizer(quant_config)
# Freeze quantized model to export. quant_model = quantizer.quantize_model(model, calib_dataloader)
freezed_model = quantizer.freeze(model)
# Freeze quantized model to export.
# Define export config. freezed_model = quantizer.freeze(model)
LLAMA_KV_CACHE_GROUP = ["*k_proj", "*v_proj"]
export_config = ExporterConfig(json_export_config=JsonExporterConfig()) # Define export config.
export_config.json_export_config.kv_cache_group = LLAMA_KV_CACHE_GROUP LLAMA_KV_CACHE_GROUP = ["*k_proj", "*v_proj"]
export_config = ExporterConfig(json_export_config=JsonExporterConfig())
# Model: Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant export_config.json_export_config.kv_cache_group = LLAMA_KV_CACHE_GROUP
EXPORT_DIR = MODEL_ID.split("/")[1] + "-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant"
exporter = ModelExporter(config=export_config, export_dir=EXPORT_DIR) # Model: Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant
with torch.no_grad(): EXPORT_DIR = MODEL_ID.split("/")[1] + "-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant"
exporter.export_safetensors_model(freezed_model, exporter = ModelExporter(config=export_config, export_dir=EXPORT_DIR)
quant_config=quant_config, tokenizer=tokenizer) with torch.no_grad():
``` exporter.export_safetensors_model(freezed_model,
quant_config=quant_config, tokenizer=tokenizer)
```
### 5. Evaluation in vLLM ### 5. Evaluation in vLLM
Now, you can load and run the Quark quantized model directly through the LLM entrypoint: Now, you can load and run the Quark quantized model directly through the LLM entrypoint:
```python ??? Code
from vllm import LLM, SamplingParams
```python
# Sample prompts. from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is", # Sample prompts.
"The president of the United States is", prompts = [
"The capital of France is", "Hello, my name is",
"The future of AI is", "The president of the United States is",
] "The capital of France is",
# Create a sampling params object. "The future of AI is",
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) ]
# Create a sampling params object.
# Create an LLM. sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant",
kv_cache_dtype='fp8',quantization='quark') # Create an LLM.
# Generate texts from the prompts. The output is a list of RequestOutput objects llm = LLM(model="Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant",
# that contain the prompt, generated text, and other information. kv_cache_dtype='fp8',quantization='quark')
outputs = llm.generate(prompts, sampling_params) # Generate texts from the prompts. The output is a list of RequestOutput objects
# Print the outputs. # that contain the prompt, generated text, and other information.
print("\nGenerated Outputs:\n" + "-" * 60) outputs = llm.generate(prompts, sampling_params)
for output in outputs: # Print the outputs.
prompt = output.prompt print("\nGenerated Outputs:\n" + "-" * 60)
generated_text = output.outputs[0].text for output in outputs:
print(f"Prompt: {prompt!r}") prompt = output.prompt
print(f"Output: {generated_text!r}") generated_text = output.outputs[0].text
print("-" * 60) print(f"Prompt: {prompt!r}")
``` print(f"Output: {generated_text!r}")
print("-" * 60)
```
Or, you can use `lm_eval` to evaluate accuracy: Or, you can use `lm_eval` to evaluate accuracy:
```console ```bash
$ lm_eval --model vllm \ lm_eval --model vllm \
--model_args pretrained=Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant,kv_cache_dtype='fp8',quantization='quark' \ --model_args pretrained=Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant,kv_cache_dtype='fp8',quantization='quark' \
--tasks gsm8k --tasks gsm8k
``` ```
...@@ -212,7 +222,7 @@ to quantize large language models more conveniently. It supports quantizing mode ...@@ -212,7 +222,7 @@ to quantize large language models more conveniently. It supports quantizing mode
of different quantization schemes and optimization algorithms. It can export the quantized model of different quantization schemes and optimization algorithms. It can export the quantized model
and run evaluation tasks on the fly. With the script, the example above can be: and run evaluation tasks on the fly. With the script, the example above can be:
```console ```bash
python3 quantize_quark.py --model_dir meta-llama/Llama-2-70b-chat-hf \ python3 quantize_quark.py --model_dir meta-llama/Llama-2-70b-chat-hf \
--output_dir /path/to/output \ --output_dir /path/to/output \
--quant_scheme w_fp8_a_fp8 \ --quant_scheme w_fp8_a_fp8 \
......
...@@ -4,7 +4,7 @@ TorchAO is an architecture optimization library for PyTorch, it provides high pe ...@@ -4,7 +4,7 @@ TorchAO is an architecture optimization library for PyTorch, it provides high pe
We recommend installing the latest torchao nightly with We recommend installing the latest torchao nightly with
```console ```bash
# Install the latest TorchAO nightly build # Install the latest TorchAO nightly build
# Choose the CUDA version that matches your system (cu126, cu128, etc.) # Choose the CUDA version that matches your system (cu126, cu128, etc.)
pip install \ pip install \
...@@ -15,26 +15,28 @@ pip install \ ...@@ -15,26 +15,28 @@ pip install \
## Quantizing HuggingFace Models ## Quantizing HuggingFace Models
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code: You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
```Python ??? Code
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer ```Python
from torchao.quantization import Int8WeightOnlyConfig import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Meta-Llama-3-8B" from torchao.quantization import Int8WeightOnlyConfig
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
quantized_model = AutoModelForCausalLM.from_pretrained( model_name = "meta-llama/Meta-Llama-3-8B"
model_name, quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
torch_dtype="auto", quantized_model = AutoModelForCausalLM.from_pretrained(
device_map="auto", model_name,
quantization_config=quantization_config torch_dtype="auto",
) device_map="auto",
tokenizer = AutoTokenizer.from_pretrained(model_name) quantization_config=quantization_config
input_text = "What are we having for dinner?" )
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
hub_repo = # YOUR HUB REPO ID input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
tokenizer.push_to_hub(hub_repo)
quantized_model.push_to_hub(hub_repo, safe_serialization=False) hub_repo = # YOUR HUB REPO ID
``` tokenizer.push_to_hub(hub_repo)
quantized_model.push_to_hub(hub_repo, safe_serialization=False)
```
Alternatively, you can use the [TorchAO Quantization space](https://huggingface.co/spaces/medmekk/TorchAO_Quantization) for quantizing models with a simple UI. Alternatively, you can use the [TorchAO Quantization space](https://huggingface.co/spaces/medmekk/TorchAO_Quantization) for quantizing models with a simple UI.
...@@ -33,34 +33,36 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ ...@@ -33,34 +33,36 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
Next, make a request to the model that should return the reasoning content in the response. Next, make a request to the model that should return the reasoning content in the response.
```python ??? Code
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server. ```python
openai_api_key = "EMPTY" from openai import OpenAI
openai_api_base = "http://localhost:8000/v1"
client = OpenAI( # Modify OpenAI's API key and API base to use vLLM's API server.
api_key=openai_api_key, openai_api_key = "EMPTY"
base_url=openai_api_base, openai_api_base = "http://localhost:8000/v1"
)
models = client.models.list() client = OpenAI(
model = models.data[0].id api_key=openai_api_key,
base_url=openai_api_base,
)
# Round 1 models = client.models.list()
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}] model = models.data[0].id
# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
# For Qwen3 series, if you want to disable thinking in reasoning mode, add:
# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
response = client.chat.completions.create(model=model, messages=messages)
reasoning_content = response.choices[0].message.reasoning_content # Round 1
content = response.choices[0].message.content messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
# For Qwen3 series, if you want to disable thinking in reasoning mode, add:
# extra_body={"chat_template_kwargs": {"enable_thinking": False}}
response = client.chat.completions.create(model=model, messages=messages)
print("reasoning_content:", reasoning_content) reasoning_content = response.choices[0].message.reasoning_content
print("content:", content) content = response.choices[0].message.content
```
print("reasoning_content:", reasoning_content)
print("content:", content)
```
The `reasoning_content` field contains the reasoning steps that led to the final conclusion, while the `content` field contains the final conclusion. The `reasoning_content` field contains the reasoning steps that led to the final conclusion, while the `content` field contains the final conclusion.
...@@ -68,164 +70,125 @@ The `reasoning_content` field contains the reasoning steps that led to the final ...@@ -68,164 +70,125 @@ The `reasoning_content` field contains the reasoning steps that led to the final
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming). Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
```json ??? Json
{
"id": "chatcmpl-123", ```json
"object": "chat.completion.chunk", {
"created": 1694268190, "id": "chatcmpl-123",
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", "object": "chat.completion.chunk",
"system_fingerprint": "fp_44709d6fcb", "created": 1694268190,
"choices": [ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
{ "system_fingerprint": "fp_44709d6fcb",
"index": 0, "choices": [
"delta": { {
"role": "assistant", "index": 0,
"reasoning_content": "is", "delta": {
}, "role": "assistant",
"logprobs": null, "reasoning_content": "is",
"finish_reason": null },
} "logprobs": null,
] "finish_reason": null
} }
``` ]
}
```
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example: OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
```python ??? Code
from openai import OpenAI
```python
# Modify OpenAI's API key and API base to use vLLM's API server. from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1" # Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
client = OpenAI( openai_api_base = "http://localhost:8000/v1"
api_key=openai_api_key,
base_url=openai_api_base, client = OpenAI(
) api_key=openai_api_key,
base_url=openai_api_base,
models = client.models.list() )
model = models.data[0].id
models = client.models.list()
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}] model = models.data[0].id
# For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
# For Qwen3 series, if you want to disable thinking in reasoning mode, add: messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
# extra_body={"chat_template_kwargs": {"enable_thinking": False}} # For granite, add: `extra_body={"chat_template_kwargs": {"thinking": True}}`
stream = client.chat.completions.create(model=model, # For Qwen3 series, if you want to disable thinking in reasoning mode, add:
messages=messages, # extra_body={"chat_template_kwargs": {"enable_thinking": False}}
stream=True) stream = client.chat.completions.create(model=model,
messages=messages,
print("client: Start streaming chat completions...") stream=True)
printed_reasoning_content = False
printed_content = False print("client: Start streaming chat completions...")
printed_reasoning_content = False
for chunk in stream: printed_content = False
reasoning_content = None
content = None for chunk in stream:
# Check the content is reasoning_content or content reasoning_content = None
if hasattr(chunk.choices[0].delta, "reasoning_content"): content = None
reasoning_content = chunk.choices[0].delta.reasoning_content # Check the content is reasoning_content or content
elif hasattr(chunk.choices[0].delta, "content"): if hasattr(chunk.choices[0].delta, "reasoning_content"):
content = chunk.choices[0].delta.content reasoning_content = chunk.choices[0].delta.reasoning_content
elif hasattr(chunk.choices[0].delta, "content"):
if reasoning_content is not None: content = chunk.choices[0].delta.content
if not printed_reasoning_content:
printed_reasoning_content = True if reasoning_content is not None:
print("reasoning_content:", end="", flush=True) if not printed_reasoning_content:
print(reasoning_content, end="", flush=True) printed_reasoning_content = True
elif content is not None: print("reasoning_content:", end="", flush=True)
if not printed_content: print(reasoning_content, end="", flush=True)
printed_content = True elif content is not None:
print("\ncontent:", end="", flush=True) if not printed_content:
# Extract and print the content printed_content = True
print(content, end="", flush=True) print("\ncontent:", end="", flush=True)
``` # Extract and print the content
print(content, end="", flush=True)
```
Remember to check whether the `reasoning_content` exists in the response before accessing it. You could checkout the [example](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py). Remember to check whether the `reasoning_content` exists in the response before accessing it. You could checkout the [example](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/openai_chat_completion_with_reasoning_streaming.py).
## Structured output
The reasoning content is also available in the structured output. The structured output engine like `xgrammar` will use the reasoning content to generate structured output. It is only supported in v0 engine now.
```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1
```
The following is an example client:
```python
from openai import OpenAI
from pydantic import BaseModel
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
class People(BaseModel):
name: str
age: int
json_schema = People.model_json_schema()
prompt = ("Generate a JSON with the name and age of one random person.")
completion = client.chat.completions.create(
model=model,
messages=[{
"role": "user",
"content": prompt,
}],
extra_body={"guided_json": json_schema},
)
print("reasoning_content: ", completion.choices[0].message.reasoning_content)
print("content: ", completion.choices[0].message.content)
```
## Tool Calling ## Tool Calling
The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`. The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
```python ??? Code
from openai import OpenAI
```python
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") from openai import OpenAI
tools = [{ client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
"type": "function",
"function": { tools = [{
"name": "get_weather", "type": "function",
"description": "Get the current weather in a given location", "function": {
"parameters": { "name": "get_weather",
"type": "object", "description": "Get the current weather in a given location",
"properties": { "parameters": {
"location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"}, "type": "object",
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} "properties": {
}, "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
"required": ["location", "unit"] "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location", "unit"]
}
} }
} }]
}]
response = client.chat.completions.create( response = client.chat.completions.create(
model=client.models.list().data[0].id, model=client.models.list().data[0].id,
messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}], messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
tools=tools, tools=tools,
tool_choice="auto" tool_choice="auto"
) )
print(response) print(response)
tool_call = response.choices[0].message.tool_calls[0].function tool_call = response.choices[0].message.tool_calls[0].function
print(f"reasoning_content: {response.choices[0].message.reasoning_content}") print(f"reasoning_content: {response.choices[0].message.reasoning_content}")
print(f"Function called: {tool_call.name}") print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}") print(f"Arguments: {tool_call.arguments}")
``` ```
For more examples, please refer to <gh-file:examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py>. For more examples, please refer to <gh-file:examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py>.
...@@ -237,85 +200,89 @@ For more examples, please refer to <gh-file:examples/online_serving/openai_chat_ ...@@ -237,85 +200,89 @@ For more examples, please refer to <gh-file:examples/online_serving/openai_chat_
You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>. You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
```python ??? Code
# import the required packages
```python
from vllm.reasoning import ReasoningParser, ReasoningParserManager # import the required packages
from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
DeltaMessage) from vllm.reasoning import ReasoningParser, ReasoningParserManager
from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
# define a reasoning parser and register it to vllm DeltaMessage)
# the name list in register_module can be used
# in --reasoning-parser. # define a reasoning parser and register it to vllm
@ReasoningParserManager.register_module(["example"]) # the name list in register_module can be used
class ExampleParser(ReasoningParser): # in --reasoning-parser.
def __init__(self, tokenizer: AnyTokenizer): @ReasoningParserManager.register_module(["example"])
super().__init__(tokenizer) class ExampleParser(ReasoningParser):
def __init__(self, tokenizer: AnyTokenizer):
def extract_reasoning_content_streaming( super().__init__(tokenizer)
self,
previous_text: str, def extract_reasoning_content_streaming(
current_text: str, self,
delta_text: str, previous_text: str,
previous_token_ids: Sequence[int], current_text: str,
current_token_ids: Sequence[int], delta_text: str,
delta_token_ids: Sequence[int], previous_token_ids: Sequence[int],
) -> Union[DeltaMessage, None]: current_token_ids: Sequence[int],
""" delta_token_ids: Sequence[int],
Instance method that should be implemented for extracting reasoning ) -> Union[DeltaMessage, None]:
from an incomplete response; for use when handling reasoning calls and """
streaming. Has to be an instance method because it requires state - Instance method that should be implemented for extracting reasoning
the current tokens/diffs, but also the information about what has from an incomplete response; for use when handling reasoning calls and
previously been parsed and extracted (see constructor) streaming. Has to be an instance method because it requires state -
""" the current tokens/diffs, but also the information about what has
previously been parsed and extracted (see constructor)
"""
def extract_reasoning_content(
self, model_output: str, request: ChatCompletionRequest
) -> tuple[Optional[str], Optional[str]]:
"""
Extract reasoning content from a complete model-generated string.
Used for non-streaming responses where we have the entire model response
available before sending to the client.
Parameters:
model_output: str
The model-generated string to extract reasoning content from.
request: ChatCompletionRequest
The request object that was used to generate the model_output.
Returns:
tuple[Optional[str], Optional[str]]
A tuple containing the reasoning content and the content.
"""
```
def extract_reasoning_content( Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
self, model_output: str, request: ChatCompletionRequest
) -> tuple[Optional[str], Optional[str]]:
"""
Extract reasoning content from a complete model-generated string.
Used for non-streaming responses where we have the entire model response
available before sending to the client.
Parameters:
model_output: str
The model-generated string to extract reasoning content from.
request: ChatCompletionRequest ??? Code
The request object that was used to generate the model_output.
Returns: ```python
tuple[Optional[str], Optional[str]] @dataclass
A tuple containing the reasoning content and the content. class DeepSeekReasoner(Reasoner):
""" """
``` Reasoner for DeepSeek R series models.
"""
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>. start_token_id: int
end_token_id: int
```python
@dataclass start_token: str = "<think>"
class DeepSeekReasoner(Reasoner): end_token: str = "</think>"
"""
Reasoner for DeepSeek R series models. @classmethod
""" def from_tokenizer(cls, tokenizer: PreTrainedTokenizer) -> Reasoner:
start_token_id: int return cls(start_token_id=tokenizer.encode(
end_token_id: int "<think>", add_special_tokens=False)[0],
end_token_id=tokenizer.encode("</think>",
start_token: str = "<think>" add_special_tokens=False)[0])
end_token: str = "</think>"
def is_reasoning_end(self, input_ids: list[int]) -> bool:
@classmethod return self.end_token_id in input_ids
def from_tokenizer(cls, tokenizer: PreTrainedTokenizer) -> Reasoner: ...
return cls(start_token_id=tokenizer.encode( ```
"<think>", add_special_tokens=False)[0],
end_token_id=tokenizer.encode("</think>",
add_special_tokens=False)[0])
def is_reasoning_end(self, input_ids: list[int]) -> bool:
return self.end_token_id in input_ids
...
```
The structured output engine like [xgrammar](https://github.com/mlc-ai/xgrammar) will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case. The structured output engine like [xgrammar](https://github.com/mlc-ai/xgrammar) will use `end_token_id` to check if the reasoning content is present in the model output and skip the structured output if it is the case.
......
...@@ -18,29 +18,31 @@ Speculative decoding is a technique which improves inter-token latency in memory ...@@ -18,29 +18,31 @@ Speculative decoding is a technique which improves inter-token latency in memory
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time. The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
```python ??? Code
from vllm import LLM, SamplingParams
```python
prompts = [ from vllm import LLM, SamplingParams
"The future of AI is",
] prompts = [
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) "The future of AI is",
]
llm = LLM( sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model="facebook/opt-6.7b",
tensor_parallel_size=1, llm = LLM(
speculative_config={ model="facebook/opt-6.7b",
"model": "facebook/opt-125m", tensor_parallel_size=1,
"num_speculative_tokens": 5, speculative_config={
}, "model": "facebook/opt-125m",
) "num_speculative_tokens": 5,
outputs = llm.generate(prompts, sampling_params) },
)
for output in outputs: outputs = llm.generate(prompts, sampling_params)
prompt = output.prompt
generated_text = output.outputs[0].text for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") prompt = output.prompt
``` generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
To perform the same with an online mode launch the server: To perform the same with an online mode launch the server:
...@@ -60,69 +62,73 @@ python -m vllm.entrypoints.openai.api_server \ ...@@ -60,69 +62,73 @@ python -m vllm.entrypoints.openai.api_server \
Then use a client: Then use a client:
```python ??? Code
from openai import OpenAI
```python
# Modify OpenAI's API key and API base to use vLLM's API server. from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1" # Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
client = OpenAI( openai_api_base = "http://localhost:8000/v1"
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key, client = OpenAI(
base_url=openai_api_base, # defaults to os.environ.get("OPENAI_API_KEY")
) api_key=openai_api_key,
base_url=openai_api_base,
models = client.models.list() )
model = models.data[0].id
models = client.models.list()
# Completion API model = models.data[0].id
stream = False
completion = client.completions.create( # Completion API
model=model, stream = False
prompt="The future of AI is", completion = client.completions.create(
echo=False, model=model,
n=1, prompt="The future of AI is",
stream=stream, echo=False,
) n=1,
stream=stream,
print("Completion results:") )
if stream:
for c in completion: print("Completion results:")
print(c) if stream:
else: for c in completion:
print(completion) print(c)
``` else:
print(completion)
```
## Speculating by matching n-grams in the prompt ## Speculating by matching n-grams in the prompt
The following code configures vLLM to use speculative decoding where proposals are generated by The following code configures vLLM to use speculative decoding where proposals are generated by
matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259) matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
```python ??? Code
from vllm import LLM, SamplingParams
```python
prompts = [ from vllm import LLM, SamplingParams
"The future of AI is",
] prompts = [
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) "The future of AI is",
]
llm = LLM( sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model="facebook/opt-6.7b",
tensor_parallel_size=1, llm = LLM(
speculative_config={ model="facebook/opt-6.7b",
"method": "ngram", tensor_parallel_size=1,
"num_speculative_tokens": 5, speculative_config={
"prompt_lookup_max": 4, "method": "ngram",
}, "num_speculative_tokens": 5,
) "prompt_lookup_max": 4,
outputs = llm.generate(prompts, sampling_params) },
)
for output in outputs: outputs = llm.generate(prompts, sampling_params)
prompt = output.prompt
generated_text = output.outputs[0].text for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") prompt = output.prompt
``` generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
## Speculating using MLP speculators ## Speculating using MLP speculators
...@@ -131,29 +137,31 @@ draft models that conditioning draft predictions on both context vectors and sam ...@@ -131,29 +137,31 @@ draft models that conditioning draft predictions on both context vectors and sam
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
[this technical report](https://arxiv.org/abs/2404.19124). [this technical report](https://arxiv.org/abs/2404.19124).
```python ??? Code
from vllm import LLM, SamplingParams
```python
prompts = [ from vllm import LLM, SamplingParams
"The future of AI is",
] prompts = [
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) "The future of AI is",
]
llm = LLM( sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
tensor_parallel_size=4, llm = LLM(
speculative_config={ model="meta-llama/Meta-Llama-3.1-70B-Instruct",
"model": "ibm-ai-platform/llama3-70b-accelerator", tensor_parallel_size=4,
"draft_tensor_parallel_size": 1, speculative_config={
}, "model": "ibm-ai-platform/llama3-70b-accelerator",
) "draft_tensor_parallel_size": 1,
outputs = llm.generate(prompts, sampling_params) },
)
for output in outputs: outputs = llm.generate(prompts, sampling_params)
prompt = output.prompt
generated_text = output.outputs[0].text for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") prompt = output.prompt
``` generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
Note that these speculative models currently need to be run without tensor parallelism, although Note that these speculative models currently need to be run without tensor parallelism, although
it is possible to run the main model using tensor parallelism (see example above). Since the it is possible to run the main model using tensor parallelism (see example above). Since the
...@@ -177,31 +185,34 @@ A variety of speculative models of this type are available on HF hub: ...@@ -177,31 +185,34 @@ A variety of speculative models of this type are available on HF hub:
The following code configures vLLM to use speculative decoding where proposals are generated by The following code configures vLLM to use speculative decoding where proposals are generated by
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py). an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
```python ??? Code
from vllm import LLM, SamplingParams
prompts = [ ```python
"The future of AI is", from vllm import LLM, SamplingParams
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM( prompts = [
model="meta-llama/Meta-Llama-3-8B-Instruct", "The future of AI is",
tensor_parallel_size=4, ]
speculative_config={ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
"draft_tensor_parallel_size": 1,
},
)
outputs = llm.generate(prompts, sampling_params) llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=4,
speculative_config={
"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
"draft_tensor_parallel_size": 1,
"num_speculative_tokens": 2,
},
)
for output in outputs: outputs = llm.generate(prompts, sampling_params)
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
A few important things to consider when using the EAGLE based draft models: A few important things to consider when using the EAGLE based draft models:
......
...@@ -21,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters: ...@@ -21,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters:
- `guided_grammar`: the output will follow the context free grammar. - `guided_grammar`: the output will follow the context free grammar.
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text. - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
You can see the complete list of supported parameters on the [OpenAI-Compatible Server][openai-compatible-server] page. You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page.
Structured outputs are supported by default in the OpenAI-Compatible Server. You Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the may choose to specify the backend to use by setting the
...@@ -33,38 +33,43 @@ text. ...@@ -33,38 +33,43 @@ text.
Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one: Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
```python ??? Code
from openai import OpenAI
client = OpenAI( ```python
base_url="http://localhost:8000/v1", from openai import OpenAI
api_key="-", client = OpenAI(
) base_url="http://localhost:8000/v1",
api_key="-",
completion = client.chat.completions.create( )
model="Qwen/Qwen2.5-3B-Instruct", model = client.models.list().data[0].id
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"} completion = client.chat.completions.create(
], model=model,
extra_body={"guided_choice": ["positive", "negative"]}, messages=[
) {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
print(completion.choices[0].message.content) ],
``` extra_body={"guided_choice": ["positive", "negative"]},
)
print(completion.choices[0].message.content)
```
The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template: The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
```python ??? Code
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-3B-Instruct", ```python
messages=[ completion = client.chat.completions.create(
{ model=model,
"role": "user", messages=[
"content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n", {
} "role": "user",
], "content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma.com\n",
extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]}, }
) ],
print(completion.choices[0].message.content) extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]},
``` )
print(completion.choices[0].message.content)
```
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats. One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats.
For this we can use the `guided_json` parameter in two different ways: For this we can use the `guided_json` parameter in two different ways:
...@@ -74,75 +79,128 @@ For this we can use the `guided_json` parameter in two different ways: ...@@ -74,75 +79,128 @@ For this we can use the `guided_json` parameter in two different ways:
The next example shows how to use the `guided_json` parameter with a Pydantic model: The next example shows how to use the `guided_json` parameter with a Pydantic model:
```python ??? Code
from pydantic import BaseModel
from enum import Enum ```python
from pydantic import BaseModel
class CarType(str, Enum): from enum import Enum
sedan = "sedan"
suv = "SUV" class CarType(str, Enum):
truck = "Truck" sedan = "sedan"
coupe = "Coupe" suv = "SUV"
truck = "Truck"
class CarDescription(BaseModel): coupe = "Coupe"
brand: str
model: str class CarDescription(BaseModel):
car_type: CarType brand: str
model: str
json_schema = CarDescription.model_json_schema() car_type: CarType
completion = client.chat.completions.create( json_schema = CarDescription.model_json_schema()
model="Qwen/Qwen2.5-3B-Instruct",
messages=[ completion = client.chat.completions.create(
{ model=model,
"role": "user", messages=[
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's", {
} "role": "user",
], "content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
extra_body={"guided_json": json_schema}, }
) ],
print(completion.choices[0].message.content) "response_format": {
``` "type": "json_schema",
"json_schema": {
"name": "car-description",
"schema": CarDescription.model_json_schema()
},
},
)
print(completion.choices[0].message.content)
```
!!! tip !!! tip
While not strictly necessary, normally it´s better to indicate in the prompt the While not strictly necessary, normally it´s better to indicate in the prompt the
JSON schema and how the fields should be populated. This can improve the JSON schema and how the fields should be populated. This can improve the
results notably in most cases. results notably in most cases.
Finally we have the `guided_grammar` option, which is probably the most Finally we have the `guided_grammar` option, which is probably the most
difficult to use, but it´s really powerful. It allows us to define complete difficult to use, but it´s really powerful. It allows us to define complete
languages like SQL queries. It works by using a context free EBNF grammar. languages like SQL queries. It works by using a context free EBNF grammar.
As an example, we can use to define a specific format of simplified SQL queries: As an example, we can use to define a specific format of simplified SQL queries:
```python ??? Code
simplified_sql_grammar = """
root ::= select_statement ```python
simplified_sql_grammar = """
root ::= select_statement
select_statement ::= "SELECT " column " from " table " where " condition
column ::= "col_1 " | "col_2 "
select_statement ::= "SELECT " column " from " table " where " condition table ::= "table_1 " | "table_2 "
column ::= "col_1 " | "col_2 " condition ::= column "= " number
table ::= "table_1 " | "table_2 " number ::= "1 " | "2 "
"""
condition ::= column "= " number completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
}
],
extra_body={"guided_grammar": simplified_sql_grammar},
)
print(completion.choices[0].message.content)
```
number ::= "1 " | "2 " See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html)
"""
completion = client.chat.completions.create( ## Reasoning Outputs
model="Qwen/Qwen2.5-3B-Instruct",
messages=[ You can also use structured outputs with <project:#reasoning-outputs> for reasoning models.
{
"role": "user", ```bash
"content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.", vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r1
}
],
extra_body={"guided_grammar": simplified_sql_grammar},
)
print(completion.choices[0].message.content)
``` ```
Full example: <gh-file:examples/online_serving/openai_chat_completion_structured_outputs.py> Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema:
??? Code
```python
from pydantic import BaseModel
class People(BaseModel):
name: str
age: int
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": "Generate a JSON with the name and age of one random person.",
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "people",
"schema": People.model_json_schema()
}
},
)
print("reasoning_content: ", completion.choices[0].message.reasoning_content)
print("content: ", completion.choices[0].message.content)
```
See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html)
## Experimental Automatic Parsing (OpenAI API) ## Experimental Automatic Parsing (OpenAI API)
...@@ -154,33 +212,33 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3. ...@@ -154,33 +212,33 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.
Here is a simple example demonstrating how to get structured output using Pydantic models: Here is a simple example demonstrating how to get structured output using Pydantic models:
```python ??? Code
from pydantic import BaseModel
from openai import OpenAI ```python
from pydantic import BaseModel
class Info(BaseModel): from openai import OpenAI
name: str
age: int class Info(BaseModel):
name: str
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy") age: int
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct", client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
messages=[ model = client.models.list().data[0].id
{"role": "system", "content": "You are a helpful assistant."}, completion = client.beta.chat.completions.parse(
{"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"}, model=model,
], messages=[
response_format=Info, {"role": "system", "content": "You are a helpful assistant."},
extra_body=dict(guided_decoding_backend="outlines"), {"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
) ],
response_format=Info,
message = completion.choices[0].message )
print(message)
assert message.parsed message = completion.choices[0].message
print("Name:", message.parsed.name) print(message)
print("Age:", message.parsed.age) assert message.parsed
``` print("Name:", message.parsed.name)
print("Age:", message.parsed.age)
Output: ```
```console ```console
ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28)) ParsedChatCompletionMessage[Testing](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28))
...@@ -190,37 +248,37 @@ Age: 28 ...@@ -190,37 +248,37 @@ Age: 28
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution: Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
```python ??? Code
from typing import List
from pydantic import BaseModel ```python
from openai import OpenAI from typing import List
from pydantic import BaseModel
class Step(BaseModel): from openai import OpenAI
explanation: str
output: str class Step(BaseModel):
explanation: str
class MathResponse(BaseModel): output: str
steps: list[Step]
final_answer: str class MathResponse(BaseModel):
steps: list[Step]
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy") final_answer: str
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct", completion = client.beta.chat.completions.parse(
messages=[ model=model,
{"role": "system", "content": "You are a helpful expert math tutor."}, messages=[
{"role": "user", "content": "Solve 8x + 31 = 2."}, {"role": "system", "content": "You are a helpful expert math tutor."},
], {"role": "user", "content": "Solve 8x + 31 = 2."},
response_format=MathResponse, ],
extra_body=dict(guided_decoding_backend="outlines"), response_format=MathResponse,
) )
message = completion.choices[0].message message = completion.choices[0].message
print(message) print(message)
assert message.parsed assert message.parsed
for i, step in enumerate(message.parsed.steps): for i, step in enumerate(message.parsed.steps):
print(f"Step #{i}:", step) print(f"Step #{i}:", step)
print("Answer:", message.parsed.final_answer) print("Answer:", message.parsed.final_answer)
``` ```
Output: Output:
...@@ -232,11 +290,11 @@ Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equa ...@@ -232,11 +290,11 @@ Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equa
Answer: x = -29/8 Answer: x = -29/8
``` ```
An example of using `structural_tag` can be found here: <gh-file:examples/online_serving/openai_chat_completion_structured_outputs_structural_tag.py> An example of using `structural_tag` can be found here: <gh-file:examples/online_serving/structured_outputs>
## Offline Inference ## Offline Inference
Offline inference allows for the same types of guided decoding. Offline inference allows for the same types of structured outputs.
To use it, we´ll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`. To use it, we´ll need to configure the guided decoding using the class `GuidedDecodingParams` inside `SamplingParams`.
The main available options inside `GuidedDecodingParams` are: The main available options inside `GuidedDecodingParams` are:
...@@ -247,22 +305,24 @@ The main available options inside `GuidedDecodingParams` are: ...@@ -247,22 +305,24 @@ The main available options inside `GuidedDecodingParams` are:
- `structural_tag` - `structural_tag`
These parameters can be used in the same way as the parameters from the Online These parameters can be used in the same way as the parameters from the Online
Serving examples above. One example for the usage of the `choice` parameter is Serving examples above. One example for the usage of the `choice` parameter is
shown below: shown below:
```python ??? Code
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct") ```python
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"]) llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
outputs = llm.generate( guided_decoding_params = GuidedDecodingParams(choice=["Positive", "Negative"])
prompts="Classify this sentiment: vLLM is wonderful!", sampling_params = SamplingParams(guided_decoding=guided_decoding_params)
sampling_params=sampling_params, outputs = llm.generate(
) prompts="Classify this sentiment: vLLM is wonderful!",
print(outputs[0].outputs[0].text) sampling_params=sampling_params,
``` )
print(outputs[0].outputs[0].text)
```
Full example: <gh-file:examples/offline_inference/structured_outputs.py> See also: [full example](https://docs.vllm.ai/en/latest/examples/online_serving/structured_outputs.html)
...@@ -15,44 +15,46 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \ ...@@ -15,44 +15,46 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \
Next, make a request to the model that should result in it using the available tools: Next, make a request to the model that should result in it using the available tools:
```python ??? Code
from openai import OpenAI
import json ```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") import json
def get_weather(location: str, unit: str): client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
return f"Getting the weather for {location} in {unit}..."
tool_functions = {"get_weather": get_weather} def get_weather(location: str, unit: str):
return f"Getting the weather for {location} in {unit}..."
tools = [{ tool_functions = {"get_weather": get_weather}
"type": "function",
"function": { tools = [{
"name": "get_weather", "type": "function",
"description": "Get the current weather in a given location", "function": {
"parameters": { "name": "get_weather",
"type": "object", "description": "Get the current weather in a given location",
"properties": { "parameters": {
"location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"}, "type": "object",
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} "properties": {
}, "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
"required": ["location", "unit"] "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location", "unit"]
}
} }
} }]
}]
response = client.chat.completions.create(
response = client.chat.completions.create( model=client.models.list().data[0].id,
model=client.models.list().data[0].id, messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}], tools=tools,
tools=tools, tool_choice="auto"
tool_choice="auto" )
)
tool_call = response.choices[0].message.tool_calls[0].function
tool_call = response.choices[0].message.tool_calls[0].function print(f"Function called: {tool_call.name}")
print(f"Function called: {tool_call.name}") print(f"Arguments: {tool_call.arguments}")
print(f"Arguments: {tool_call.arguments}") print(f"Result: {tool_functions[tool_call.name](**json.loads(tool_call.arguments))}")
print(f"Result: {get_weather(**json.loads(tool_call.arguments))}") ```
```
Example output: Example output:
...@@ -97,6 +99,14 @@ vLLM supports the `tool_choice='required'` option in the chat completion API. Si ...@@ -97,6 +99,14 @@ vLLM supports the `tool_choice='required'` option in the chat completion API. Si
When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter. When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter.
## None Function Calling
vLLM supports the `tool_choice='none'` option in the chat completion API. When this option is set, the model will not generate any tool calls and will respond with regular text content only, even if tools are defined in the request.
By default, when `tool_choice='none'` is specified, vLLM excludes tool definitions from the prompt to optimize context usage. To include tool definitions even with `tool_choice='none'`, use the `--expand-tools-even-if-tool-choice-none` option.
Note: This behavior will change in v0.10.0, where tool definitions will be included by default even with `tool_choice='none'`.
## Automatic Function Calling ## Automatic Function Calling
To enable this feature, you should set the following flags: To enable this feature, you should set the following flags:
...@@ -226,6 +236,25 @@ AI21's Jamba-1.5 models are supported. ...@@ -226,6 +236,25 @@ AI21's Jamba-1.5 models are supported.
Flags: `--tool-call-parser jamba` Flags: `--tool-call-parser jamba`
### xLAM Models (`xlam`)
The xLAM tool parser is designed to support models that generate tool calls in various JSON formats. It detects function calls in several different output styles:
1. Direct JSON arrays: Output strings that are JSON arrays starting with `[` and ending with `]`
2. Thinking tags: Using `<think>...</think>` tags containing JSON arrays
3. Code blocks: JSON in code blocks (```json ...```)
4. Tool calls tags: Using `[TOOL_CALLS]` or `<tool_call>...</tool_call>` tags
Parallel function calls are supported, and the parser can effectively separate text content from tool calls.
Supported models:
* Salesforce Llama-xLAM models: `Salesforce/Llama-xLAM-2-8B-fc-r`, `Salesforce/Llama-xLAM-2-70B-fc-r`
* Qwen-xLAM models: `Salesforce/xLAM-1B-fc-r`, `Salesforce/xLAM-3B-fc-r`, `Salesforce/Qwen-xLAM-32B-fc-r`
Flags:
* For Llama-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_llama.jinja`
* For Qwen-based xLAM models: `--tool-call-parser xlam --chat-template examples/tool_chat_template_xlam_qwen.jinja`
### Qwen Models ### Qwen Models
For Qwen2.5, the chat template in tokenizer_config.json has already included support for the Hermes-style tool use. Therefore, you can use the `hermes` parser to enable tool calls for Qwen models. For more detailed information, please refer to the official [Qwen documentation](https://qwen.readthedocs.io/en/latest/framework/function_call.html#vllm) For Qwen2.5, the chat template in tokenizer_config.json has already included support for the Hermes-style tool use. Therefore, you can use the `hermes` parser to enable tool calls for Qwen models. For more detailed information, please refer to the official [Qwen documentation](https://qwen.readthedocs.io/en/latest/framework/function_call.html#vllm)
...@@ -235,6 +264,15 @@ For Qwen2.5, the chat template in tokenizer_config.json has already included sup ...@@ -235,6 +264,15 @@ For Qwen2.5, the chat template in tokenizer_config.json has already included sup
Flags: `--tool-call-parser hermes` Flags: `--tool-call-parser hermes`
### MiniMax Models (`minimax_m1`)
Supported models:
* `MiniMaxAi/MiniMax-M1-40k` (use with <gh-file:examples/tool_chat_template_minimax.jinja>)
* `MiniMaxAi/MiniMax-M1-80k` (use with <gh-file:examples/tool_chat_template_minimax.jinja>)
Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_minimax.jinja`
### DeepSeek-V3 Models (`deepseek_v3`) ### DeepSeek-V3 Models (`deepseek_v3`)
Supported models: Supported models:
...@@ -282,53 +320,55 @@ A tool parser plugin is a Python file containing one or more ToolParser implemen ...@@ -282,53 +320,55 @@ A tool parser plugin is a Python file containing one or more ToolParser implemen
Here is a summary of a plugin file: Here is a summary of a plugin file:
```python ??? Code
# import the required packages ```python
# define a tool parser and register it to vllm # import the required packages
# the name list in register_module can be used
# in --tool-call-parser. you can define as many # define a tool parser and register it to vllm
# tool parsers as you want here. # the name list in register_module can be used
@ToolParserManager.register_module(["example"]) # in --tool-call-parser. you can define as many
class ExampleToolParser(ToolParser): # tool parsers as you want here.
def __init__(self, tokenizer: AnyTokenizer): @ToolParserManager.register_module(["example"])
super().__init__(tokenizer) class ExampleToolParser(ToolParser):
def __init__(self, tokenizer: AnyTokenizer):
# adjust request. e.g.: set skip special tokens super().__init__(tokenizer)
# to False for tool call output.
def adjust_request( # adjust request. e.g.: set skip special tokens
self, request: ChatCompletionRequest) -> ChatCompletionRequest: # to False for tool call output.
return request def adjust_request(
self, request: ChatCompletionRequest) -> ChatCompletionRequest:
# implement the tool call parse for stream call return request
def extract_tool_calls_streaming(
self, # implement the tool call parse for stream call
previous_text: str, def extract_tool_calls_streaming(
current_text: str, self,
delta_text: str, previous_text: str,
previous_token_ids: Sequence[int], current_text: str,
current_token_ids: Sequence[int], delta_text: str,
delta_token_ids: Sequence[int], previous_token_ids: Sequence[int],
request: ChatCompletionRequest, current_token_ids: Sequence[int],
) -> Union[DeltaMessage, None]: delta_token_ids: Sequence[int],
return delta request: ChatCompletionRequest,
) -> Union[DeltaMessage, None]:
# implement the tool parse for non-stream call return delta
def extract_tool_calls(
self, # implement the tool parse for non-stream call
model_output: str, def extract_tool_calls(
request: ChatCompletionRequest, self,
) -> ExtractedToolCallInformation: model_output: str,
return ExtractedToolCallInformation(tools_called=False, request: ChatCompletionRequest,
tool_calls=[], ) -> ExtractedToolCallInformation:
content=text) return ExtractedToolCallInformation(tools_called=False,
tool_calls=[],
``` content=text)
```
Then you can use this plugin in the command line like this. Then you can use this plugin in the command line like this.
```console ```bash
--enable-auto-tool-choice \ --enable-auto-tool-choice \
--tool-parser-plugin <absolute path of the plugin file> --tool-parser-plugin <absolute path of the plugin file>
--tool-call-parser example \ --tool-call-parser example \
......
...@@ -2,4 +2,6 @@ nav: ...@@ -2,4 +2,6 @@ nav:
- README.md - README.md
- gpu.md - gpu.md
- cpu.md - cpu.md
- ai_accelerator.md - google_tpu.md
\ No newline at end of file - intel_gaudi.md
- aws_neuron.md
...@@ -14,7 +14,6 @@ vLLM supports the following hardware platforms: ...@@ -14,7 +14,6 @@ vLLM supports the following hardware platforms:
- [ARM AArch64](cpu.md#arm-aarch64) - [ARM AArch64](cpu.md#arm-aarch64)
- [Apple silicon](cpu.md#apple-silicon) - [Apple silicon](cpu.md#apple-silicon)
- [IBM Z (S390X)](cpu.md#ibm-z-s390x) - [IBM Z (S390X)](cpu.md#ibm-z-s390x)
- [Other AI accelerators](ai_accelerator.md) - [Google TPU](google_tpu.md)
- [Google TPU](ai_accelerator.md#google-tpu) - [Intel Gaudi](intel_gaudi.md)
- [Intel Gaudi](ai_accelerator.md#intel-gaudi) - [AWS Neuron](aws_neuron.md)
- [AWS Neuron](ai_accelerator.md#aws-neuron)
# Other AI accelerators
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation"
## Requirements
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements"
## Configure a new environment
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment"
## Set up using Python
### Pre-built wheels
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels"
### Build wheel from source
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source"
## Set up using Docker
### Pre-built images
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images"
### Build image from source
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source"
## Extra information
=== "Google TPU"
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information"
=== "Intel Gaudi"
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information"
=== "AWS Neuron"
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment