A model repository in OpenLLM represents a catalog of available LLMs. You can add your own repository to OpenLLM with custom Qwen2.5 variants for your specific needs. See our `documentation to learn details <https://github.com/bentoml/OpenLLM?tab=readme-ov-file#model-repository>`_.
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
```
:::
:::{tab-item} Python
You can use the API client with the `openai` Python SDK as shown below:
```python
fromopenaiimportOpenAI
# Set OpenAI's API key and API base to use SGLang's API server.
openai_api_key="EMPTY"
openai_api_base="http://localhost:30000/v1"
client=OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response=client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role":"user","content":"Give me a short introduction to large language models."},
],
temperature=0.6,
top_p=0.95,
top_k=20,
max_tokens=32768,
)
print("Chat response:",chat_response)
```
::::
:::{tip}
While the default sampling parameters would work most of the time for thinking mode,
it is recommended to adjust the sampling parameters according to your application,
and always pass the sampling parameters to the API.
:::
### Thinking & Non-Thinking Modes
:::{important}
This feature has not been released.
For more information, please see this [pull request](https://github.com/sgl-project/sglang/pull/5551).
:::
Qwen3 models will think before respond.
This behaviour could be controled by either the hard switch, which could disable thinking completely, or the soft switch, where the model follows the instruction of the user on whether or not it should think.
The hard switch is availabe in SGLang through the following configuration to the API call.
The context length for Qwen3 models in pretraining is up to 32,768 tokenns.
To handle context length substantially exceeding 32,768 tokens, RoPE scaling techniques should be applied.
We have validated the performance of [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
SGLang implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
We advise adding the `rope_scaling` configuration only when processing long contexts is required.
It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
:::
:::{note}
The default `max_position_embeddings` in `config.json` is set to 40,960, which is used by SGLang.
This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing and leave adequate room for model thinking.
If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
After that, you need to verify cloud access with a command like:
.. code:: bash
sky check
For more information, check the `official document <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ and see if you have
set up your cloud accounts correctly.
Alternatively, you can also use the official docker image with SkyPilot
master branch automatically cloned by running:
.. code:: bash
# NOTE: '--platform linux/amd64' is needed for Apple Silicon Macs
Hugging Face's Text Generation Inference (TGI) is a production-ready framework specifically designed for deploying and serving large language models (LLMs) for text generation tasks. It offers a seamless deployment experience, powered by a robust set of features:
The easiest way to use TGI is via the TGI docker image. In this guide, we show how to use TGI with docker.
It's possible to run it locally via Conda or build locally. Please refer to `Installation Guide <https://huggingface.co/docs/text-generation-inference/installation>`_ and `CLI tool <https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/using_cli>`_ for detailed instructions.
Deploy Qwen2.5 with TGI
-----------------------
1. **Find a Qwen2.5 Model:** Choose a model from `the Qwen2.5 collection <https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e>`_.
2. **Deployment Command:** Run the following command in your terminal, replacing ``model`` with your chosen Qwen2.5 model ID and ``volume`` with the path to your local data directory:
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
Using TGI API
-------------
Once deployed, the model will be available on the mapped port (8080).
TGI comes with a handy API for streaming response:
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'
.. note::
The model field in the JSON is not used by TGI, you can put anything.
Refer to the `TGI Swagger UI <https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/completions>`_ for a complete API reference.
You can also use Python API:
.. code:: python
from openai import OpenAI
# initialize the client but point it to TGI
client = OpenAI(
base_url="http://localhost:8080/v1/", # replace with your endpoint url
api_key="", # this field is not used when running locally
)
chat_completion = client.chat.completions.create(
model="", # it is not used by TGI, you can put anything
messages=[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
],
stream=True,
temperature=0.7,
top_p=0.8,
max_tokens=512,
)
# iterate and print stream
for message in chat_completion:
print(message.choices[0].delta.content, end="")
Quantization for Performance
----------------------------
1. Data-dependent quantization (GPTQ and AWQ)
Both GPTQ and AWQ models are data-dependent. The official quantized models can be found from `the Qwen2.5 collection`_ and you can also quantize models with your own dataset to make it perform better on your use case.
The following shows the command to start TGI with Qwen2.5-7B-Instruct-GPTQ-Int4:
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --quantize gptq
If the model is quantized with AWQ, e.g. Qwen/Qwen2.5-7B-Instruct-AWQ, please use ``--quantize awq``.
2. Data-agnostic quantization
EETQ on the other side is not data dependent and can be used with any model. Note that we're passing in the original model (instead of a quantized model) with the ``--quantize eetq`` flag.
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --quantize eetq
Multi-Accelerators Deployment
-----------------------------
Use the ``--num-shard`` flag to specify the number of accelerators. Please also use ``--shm-size 1g`` to enable shared memory for optimal NCCL performance (`reference <https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#a-note-on-shared-memory-shm>`__):
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --num-shard 2
Speculative Decoding
--------------------
Speculative decoding can reduce the time per token by speculating on the next token. Use the ``--speculative-decoding`` flag, setting the value to the number of tokens to speculate on (default: 0 for no speculation):
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 2
The overall performance of speculative decoding highly depends on the type of task. It works best for code or highly repetitive text.
More context on speculative decoding can be found `here <https://huggingface.co/docs/text-generation-inference/conceptual/speculation>`__.
Qwen2.5 supports long context lengths, so carefully choose the values for ``--max-batch-prefill-tokens``, ``--max-total-tokens``, and ``--max-input-tokens`` to avoid potential out-of-memory (OOM) issues. If an OOM occurs, you'll receive an error message upon startup. The following shows an example to modify those parameters:
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --max-batch-prefill-tokens 4096 --max-total-tokens 4096 --max-input-tokens 2048
We recommend you trying [vLLM](https://github.com/vllm-project/vllm) for your deployment of Qwen.
It is simple to use, and it is fast with state-of-the-art serving throughput, efficient management of attention key value memory with PagedAttention, continuous batching of input requests, optimized CUDA kernels, etc.
To learn more about vLLM, please refer to the [paper](https://arxiv.org/abs/2309.06180) and [documentation](https://docs.vllm.ai/).
## Environment Setup
By default, you can install `vllm` with pip in a clean environment:
```shell
pip install"vllm>=0.8.4"
```
Please note that the prebuilt `vllm` has strict dependencies on `torch` and its CUDA versions.
Check the note in the official document for installation ([link](https://docs.vllm.ai/en/latest/getting_started/installation.html)) for more help.
## API Service
It is easy to build an OpenAI-compatible API service with vLLM, which can be deployed as a server that implements OpenAI API protocol.
By default, it starts the server at `http://localhost:8000`.
You can specify the address with `--host` and `--port` arguments.
Run the command as shown below:
```shell
vllm serve Qwen/Qwen3-8B
```
By default, if the model does not point to a valid local directory, it will download the model files from the HuggingFace Hub.
To download model from ModelScope, set the following before running the above command:
```shell
export VLLM_USE_MODELSCOPE=true
```
For distrbiuted inference with tensor parallelism, it is as simple as
```shell
vllm server Qwen/Qwen3-8B --tensor-parallel-size 4
```
The above command will use tensor parallelism on 4 GPUs.
You should change the number of GPUs according to your demand.
### Basic Usage
Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
```
:::
:::{tab-item} Python
You can use the API client with the `openai` Python SDK as shown below:
```python
fromopenaiimportOpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key="EMPTY"
openai_api_base="http://localhost:8000/v1"
client=OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response=client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role":"user","content":"Give me a short introduction to large language models."},
],
temperature=0.6,
top_p=0.95,
top_k=20,
max_tokens=32768,
)
print("Chat response:",chat_response)
```
::::
:::{tip}
`vllm` will use the sampling parameters from the `generation_config.json` in the model files.
While the default sampling parameters would work most of the time for thinking mode,
it is recommended to adjust the sampling parameters according to your application,
and always pass the sampling parameters to the API.
:::
### Thinking & Non-Thinking Modes
Qwen3 models will think before respond.
This behaviour could be controled by either the hard switch, which could disable thinking completely, or the soft switch, where the model follows the instruction of the user on whether or not it should think.
The hard switch is availabe in vLLM through the following configuration to the API call.
For more information, please refer to [our guide on Function Calling](../framework/function_call.md#vllm).
:::{note}
As of vLLM 0.5.4, it is not supported to parse the thinking content and the tool calling from the model generation at the same time.
:::
### Structured/JSON Output
vLLM supports structured/JSON output.
Please refer to [vLLM's documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#extra-parameters-for-chat-api) for the `guided_json` parameters.
Besides, it is also recommended to instruct the model to generate the specific format in the system message or in your prompt.
### Serving Quantized models
Qwen3 comes with two types of pre-quantized models, FP8 and AWQ.
The command serving those models are the same as the original models except for the name change:
```shell
# For FP8 quantized model
vllm serve Qwen3/Qwen3-8B-FP8
# For AWQ quantized model
vllm serve Qwen3/Qwen3-8B-AWQ
```
:::{note}
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9, that is, Ada Lovelace, Hopper, and later GPUs.
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
:::
:::{important}
As of vLLM 0.5.4, there are currently compatibility issues with `vllm` with the Qwen3 FP8 checkpoints.
For a quick fix, you should make the following changes to the file `vllm/vllm/model_executor/layers/linear.py`:
```python
# these changes are in QKVParallelLinear.weight_loader_v2() of vllm/vllm/model_executor/layers/linear.py
The context length for Qwen3 models in pretraining is up to 32,768 tokenns.
To handle context length substantially exceeding 32,768 tokens, RoPE scaling techniques should be applied.
We have validated the performance of [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
vLLM implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
We advise adding the `rope_scaling` configuration only when processing long contexts is required.
It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
:::
:::{note}
The default `max_position_embeddings` in `config.json` is set to 40,960, which used by vLLM, if `--max-model-len` is not specified.
This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing and leave adequate room for model thinking.
If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
:::
## Python Library
vLLM can also be directly used as a Python library, which is convinient for offline batch inference but lack some API-only features, such as parsing model generation to structure messages.
The following shows the basic usage of vLLM as a library:
You may encounter OOM issues that are pretty annoying.
We recommend two arguments for you to make some fix.
- The first one is `--max-model-len`.
Our provided default `max_position_embedding` is `40960` and thus the maximum length for the serving is also this value, leading to higher requirements of memory.
Reducing it to a proper length for yourself often helps with the OOM issue.
- Another argument you can pay attention to is `--gpu-memory-utilization`.
vLLM will pre-allocate this much GPU memory.
By default, it is `0.9`.
This is also why you find a vLLM service always takes so much memory.
If you are in eager mode (by default it is not), you can level it up to tackle the OOM problem.
Otherwise, CUDA Graphs are used, which will use GPU memory not controlled by vLLM, and you should try lowering it.
If it doesn't work, you should try `--enforce-eager`, which may slow down infernece, or reduce the `--max-model-len`.
Based on the above known information, respond to the user's question concisely and professionally. If an answer cannot be derived from it, say 'The question cannot be answered with the given information' or 'Not enough relevant information has been provided,' and do not include fabricated details in the answer. Please respond in English. The question is {question}"""
To connect Qwen2.5 with external data, such as documents, web pages, etc., we offer a tutorial on `LlamaIndex <https://www.llamaindex.ai/>`__.
This guide helps you quickly implement retrieval-augmented generation (RAG) using LlamaIndex with Qwen2.5.
Preparation
--------------------------------------
To implement RAG,
we advise you to install the LlamaIndex-related packages first.
The following is a simple code snippet showing how to do this:
.. code:: bash
pip install llama-index
pip install llama-index-llms-huggingface
pip install llama-index-readers-web
Set Parameters
--------------------------------------
Now we can set up LLM, embedding model, and the related configurations.
Qwen2.5-Instruct supports conversations in multiple languages, including English and Chinese.
You can use the ``bge-base-en-v1.5`` model to retrieve from English documents, and you can download the ``bge-base-zh-v1.5`` model to retrieve from Chinese documents.
You can also choose ``bge-large`` or ``bge-small`` as the embedding model or modify the context window size or text chunk size depending on your computing resources.
Qwen2.5 model families support a maximum of 32K context window size (up to 128K for 7B, 14B, 32B, and 72B, requiring extra configuration)
.. code:: python
import torch
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
Now we can build index from documents or websites.
The following code snippet demonstrates how to build an index for files (regardless of whether they are in PDF or TXT format) in a local folder named 'document'.
.. code:: python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
Since the support for tool calling in Qwen3 is a superset of that in Qwen2, the examples would still work.
:::
## Preface
Function calling with large language models is a huge and evolving topic.
It is particularly important for AI applications:
- either for AI-native applications that strive to work around the shortcomings of current AI technology,
- or for existing applications that seeks the integration of AI technology to improve performance, user interaction and experience, or efficiency.
This guide will not delve into those discussions or which role an LLM should play in an application and the related best practice.
Those views are reflected in the design of AI application frameworks: from LangChain to LlamaIndex to QwenAgent.
Instead, we will talk about how Qwen2.5 can be used to support function calling and how it can be used to achieve your goals, from the inference usage for developing application to the inner workings for hardcore customizations.
In this guide,
- We will first demonstrate how to use function calling with Qwen2.5.
- Then, we will introduce the technical details on functional calling with Qwen2.5, which are mainly about the templates.
Before starting, there is one thing we have not yet introduced, that is ...
## What is function calling?
:::{Note}
There is another term "tool use" that may be used to refer to the same concept.
While some may argue that tools are a generalized form of functions, at present, their difference exists only technically as different I/O types of programming interfaces.
:::
Large language models (LLMs) are powerful things.
However, sometimes LLMs by themselves are simply not capable enough.
- On the one hand, LLMs have inherent modeling limitations.
For one, they do not know things that are not in their training data, which include those happened after their training ended.
In addition, they learn things in the way of likelihood, which suggests that they may not be precise enough for tasks with fixed rule sets, e.g., mathematical computation.
- On the other hand, it is not easy to use LLMs as a Plug-and-Play service programmatically with other things.
LLMs mostly talk in words that are open to interpretation and thus ambiguous, while other software or applications or systems talk in code and through programming interfaces that are pre-defined and fixed and structured.
To this end, function calling establishes a common protocol that specifies how LLMs should interact with the other things.
The procedure is mainly as follows:
1. The application provides a set of functions and the instructions of the functions to an LLM.
2. The LLM choose to or not to, or is forced to use one or many of the functions, in response to user queries.
3. If the LLM chooses to use the functions, it states how the functions should be used based on the function instructions.
4. The chosen functions are used as such by the application and the results are obtained, which are then given to the LLM if further interaction is needed.
They are many ways for LLMs to understand and follow this protocol.
As always, the key is prompt engineering or an internalized template known by the model.
Qwen2.5 were pre-trained with various types of templates that could support function calling, so that users can directly make use of this procedure.
## Inference with Function Calling
:::{note}
Please be aware that the inference usage is subject to change as the frameworks and the Qwen models evolve.
:::
As function calling is essentially implemented using prompt engineering, you could manually construct the model inputs for Qwen2 models.
However, frameworks with function calling support can help you with all that laborious work.
In the following, we will introduce the usage (via dedicated function calling chat template) with
-**Qwen-Agent**,
-**Hugging Face transformers**,
-**Ollama**, and
-**vLLM**.
If you are familiar with the usage of OpenAI API, you could also directly use the OpenAI-compatible API services for Qwen2.5.
However, not all of them support function calling for Qwen2.5.
Currently, supported solutions include the self-hosted service by [Ollama](https://github.com/ollama/ollama/blob/main/docs/openai.md) or [vLLM](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api) and the cloud service of [ModelStudio \[zh\]](https://help.aliyun.com/zh/model-studio/developer-reference/compatibility-of-openai-with-dashscope#97e2b45391x08).
If you are familiar with application frameworks, e.g., LangChain, you can also use function calling abilities in Qwen2.5 via ReAct Prompting.
### The Example Case
Let's also use an example to demonstrate the inference usage.
We assume **Python 3.11** is used as the programming language.
**Scenario**: Suppose we would like to ask the model about the temperature of a location.
Normally, the model would reply that it cannot provide real-time information.
But we have two tools that can be used to obtain the current temperature of and the temperature at a given date of a city respectively, and we would like the model to make use of them.
To set up the example case, you can use the following code:
location: The location to get the temperature for, in the format "City, State, Country".
date: The date to get the temperature for, in the format "Year-Month-Day".
unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
Returns:
the temperature, the location, the date and the unit in a dict
"""
return{
"temperature":25.9,
"location":location,
"date":date,
"unit":unit,
}
defget_function_by_name(name):
ifname=="get_current_temperature":
returnget_current_temperature
ifname=="get_temperature_date":
returnget_temperature_date
TOOLS=[
{
"type":"function",
"function":{
"name":"get_current_temperature",
"description":"Get current temperature at a location.",
"parameters":{
"type":"object",
"properties":{
"location":{
"type":"string",
"description":'The location to get the temperature for, in the format "City, State, Country".',
},
"unit":{
"type":"string",
"enum":["celsius","fahrenheit"],
"description":'The unit to return the temperature in. Defaults to "celsius".',
},
},
"required":["location"],
},
},
},
{
"type":"function",
"function":{
"name":"get_temperature_date",
"description":"Get temperature at a location and date.",
"parameters":{
"type":"object",
"properties":{
"location":{
"type":"string",
"description":'The location to get the temperature for, in the format "City, State, Country".',
},
"date":{
"type":"string",
"description":'The date to get the temperature for, in the format "Year-Month-Day".',
},
"unit":{
"type":"string",
"enum":["celsius","fahrenheit"],
"description":'The unit to return the temperature in. Defaults to "celsius".',
},
},
"required":["location","date"],
},
},
},
]
MESSAGES=[
{"role":"system","content":"You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30"},
{"role":"user","content":"What's the temperature in San Francisco now? How about tomorrow?"},
]
```
:::
In particular, the tools should be described using JSON Schema and the messages should contain as much available information as possible.
You can find the explanations of the tools and messages below:
:::{dropdown} Example Tools
The tools should be described using the following JSON:
```json
[
{
"type":"function",
"function":{
"name":"get_current_temperature",
"description":"Get current temperature at a location.",
"parameters":{
"type":"object",
"properties":{
"location":{
"type":"string",
"description":"The location to get the temperature for, in the format \"City, State, Country\"."
},
"unit":{
"type":"string",
"enum":[
"celsius",
"fahrenheit"
],
"description":"The unit to return the temperature in. Defaults to \"celsius\"."
}
},
"required":[
"location"
]
}
}
},
{
"type":"function",
"function":{
"name":"get_temperature_date",
"description":"Get temperature at a location and date.",
"parameters":{
"type":"object",
"properties":{
"location":{
"type":"string",
"description":"The location to get the temperature for, in the format \"City, State, Country\"."
},
"date":{
"type":"string",
"description":"The date to get the temperature for, in the format \"Year-Month-Day\"."
},
"unit":{
"type":"string",
"enum":[
"celsius",
"fahrenheit"
],
"description":"The unit to return the temperature in. Defaults to \"celsius\"."
}
},
"required":[
"location",
"date"
]
}
}
}
]
```
For each **tool**, it is a JSON object with two fields:
-`type`: a string specifying the type of the tool, currently only `"function"` is valid
-`function`: an object detailing the instructions to use the function
For each **function**, it is a JSON object with three fields:
-`name`: a string indicating the name of the function
-`description`: a string describing what the function is used for
-`parameters`: [a JSON Schema](https://json-schema.org/learn/getting-started-step-by-step) that specifies the parameters the function accepts. Please refer to the linked documentation for how to compose a JSON Schema. Notable fields include `type`, `required`, and `enum`.
Most frameworks use the tool format and some may use the function format.
Which one to use should be obvious according to the naming.
:::
:::{dropdown} Example Messages
Our query is `What's the temperature in San Francisco now? How about tomorrow?`.
Since the model does not know what the current date is, let alone tomorrow, we should provide the date in the inputs.
Here, we decide to supply that information in the system message after the default system message `You are Qwen, created by Alibaba Cloud. You are a helpful assistant.`.
You could append the date to user message in your application code.
```json
[
{"role":"system","content":"You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30"},
{"role":"user","content":"What's the temperature in San Francisco now? How about tomorrow?"}
]
```
:::
### Qwen-Agent
[Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) is actually a Python Agent framework for developing AI applications.
Although its intended use cases are higher-level than efficient inference, it does contain the **canonical implementation** of function calling for Qwen2.5.
It provides the function calling ability for Qwen2.5 to an OpenAI-compatible API through templates that is transparent to users.
{#note-official-template}
It's worth noting that since a lot of stuff can be done under the scene with application frameworks, currently the official function calling implementation for Qwen2.5 is very flexible and beyond simple templating, making it hard to adapt it other frameworks that use less capable templating engines.
Before starting, let's make sure the latest library is installed:
```bash
pip install-U qwen-agent
```
For this guide, we are at version v0.0.10.
#### Preparing
Qwen-Agent can wrap an OpenAI-compatible API that does not support function calling.
You can serve such an API with most inference frameworks or obtain one from cloud providers like DashScope or Together.
Assuming there is an OpenAI-compatible API at `http://localhost:8000/v1`, Qwen-Agent provides a shortcut function `get_chat_model` to obtain a model inference class with function calling support:
```python
fromqwen_agent.llmimportget_chat_model
llm=get_chat_model({
"model":"Qwen/Qwen2.5-7B-Instruct",
"model_server":"http://localhost:8000/v1",
"api_key":"EMPTY",
})
```
In the above, `model_server` is the `api_base` common used in other OpenAI-compatible API clients.
It is advised to provide the `api_key` (but not via plaintext in the code), even if the API server does not check it, in which case, you can set it to anything.
For model inputs, the common message structure for system, user, and assistant history should be used:
```python
messages=MESSAGES[:]
# [
# {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30"},
# {"role": "user", "content": "What's the temperature in San Francisco now? How about tomorrow?"}
# ]
```
We add the current date to the system message so that the "tomorrow" in the user message is anchored.
It can also be added to the user message if one desires.
At the time, Qwen-Agent works with functions instead of tools.
This requires a small change to our tool descriptions, that is, extracting the function fields:
```python
functions=[tool["function"]fortoolinTOOLS]
```
#### Tool Calls and Tool Results
To interact with the model, the `chat` method should be used:
In the above code, the `chat` method receives the `messages`, the `functions`, and an `extra_generate_cfg` parameter.
You can put sampling parameters, such as `temperature`, and `top_p`, in the `extra_generate_cfg`.
Here, we add to it a special control `parallel_function_calls` provided by Qwen-Agent.
As its name suggests, it will enable parallel function calls, which means that the model may generate multiple function calls for a single turn as it deems fit.
The `chat` method returns a generator of list, each of which may contain multiple messages.
Since we enable `parallel_function_calls`, we should get two messages in the responses:
```python
[
{'role':'assistant','content':'','function_call':{'name':'get_current_temperature','arguments':'{"location": "San Francisco, CA, USA", "unit": "celsius"}'}},
{'role':'assistant','content':'','function_call':{'name':'get_temperature_date','arguments':'{"location": "San Francisco, CA, USA", "date": "2024-10-01", "unit": "celsius"}'}},
]
```
As we can see, Qwen-Agent attempts to parse the model generation in an easier to use structural format.
The details related to function calls are placed in the `function_call` field of the messages:
-`name`: a string representing the function to call
-`arguments`: a JSON-formatted string representing the arguments the function should be called with
Note that Qwen2.5-7B-Instruct is quite capable:
- It has followed the function instructions to add the state and the country to the location.
- It has correctly induced the date of tomorrow and given in the format required by the function.
Then comes the critical part -- checking and applying the function call:
- line 1: We should iterate the function calls in the order the model generates them.
- line 2: We can check if a function call is needed as deemed by the model by checking the `function_call` field of the generated messages.
- line 3-4: The related details including the name and the arguments of the function can also be found there, which are `name` and `arguments` respectively.
- line 6: With the details, one should call the function and obtain the results.
Here, we assume there is a function named [`get_function_by_name`](#prepcode) to help us get the related function by its name.
- line 8-12: With the result obtained, add the function result to the messages as `content` and with `role` as `"function"`.
Now the messages are
```python
[
{'role':'system','content':'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30'},
{'role':'user','content':"What's the temperature in San Francisco now? How about tomorrow?"},
{'role':'assistant','content':'','function_call':{'name':'get_current_temperature','arguments':'{"location": "San Francisco, CA, USA", "unit": "celsius"}'}},
{'role':'assistant','content':'','function_call':{'name':'get_temperature_date','arguments':'{"location": "San Francisco, CA, USA", "date": "2024-10-01", "unit": "celsius"}'}},
{'role':'function','name':'get_current_temperature','content':'{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}'},
{'role':'assistant','content':'Currently, the temperature in San Francisco is approximately 26.1°C. Tomorrow, on 2024-10-01, the temperature is forecasted to be around 25.9°C.'}
```
### Hugging Face transformers
Since function calling is based on prompt engineering and templates, `transformers` supports it with its tokenizer utilities, in particular, the `tokenizer.apply_chat_template` method, which hides the sophistication of constructing the model inputs, using the Jinja templating engine.
However, it means that users should handle the model output part on their own, which includes parsing the generated function call message.
The blog piece [_Tool Use, Unified_](https://huggingface.co/blog/unified-tool-use) is very helpful in understanding its design.
Be sure to take a look.
Tool use API is available in transformers since v4.42.0.
Before starting, let's check that:
```bash
pip install"transformers>4.42.0"
```
For this guide, we are at version v4.44.2.
#### Preparing
For Qwen2.5, the chat template in `tokenizer_config.json` has already included support for the Hermes-style tool use.
We simply need to load the model and the tokenizer:
[^get_json_schema_note]:`transformers` will use `transformers.utils.get_json_schema` to generate the tool descriptions from Python functions.
There are some gotchas with `get_json_schema`, and it is advised to check [its doc \[v4.44.2\]](https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/utils/chat_template_utils.py#L183-L288) before relying on it.
- The function should use Python type hints for parameter types and has a Google-style docstring for function description and parameter descriptions.
- Supported types are limited, since the types needs to be mapped to JSON Schema.
In particular, `typing.Literal` is not supported.
You can instead add `(choices: ...)` at the end of a parameter description, which will be mapped to a `enum` type in JSON Schema.
Please be aware that all the returned results in the examples in the linked docstring are actually the content of the `function` field in the actual returned results.
#### Tool Calls and Tool Results
To construct the input sequence, we should use the `apply_chat_template` method and then let the model continue the texts:
This function does not cover all possible scenarios and thus is prone to errors.
But it should suffice for the purpose of this guide.
:::{note}
The template in the `tokenizer_config.json` assumes that the generated content alongside tool calls is in the same message instead of separate assistant messages, e.g.,
```json
{
"role":"assistant",
"content":"To obtain the current temperature, I should call the functions `get_current_temperate`.",
"tool_calls":[
{"type":"function","function":{"name":"get_current_temperature","arguments":{"location":"San Francisco, CA, USA","unit":"celsius"}}}
]
}
```
instead of
```json
[
{
"role":"assistant",
"content":"To obtain the current temperature, I should call the functions `get_current_temperate`.",
},
{
"role":"assistant",
"content":"",
"tool_calls":[
{"type":"function","function":{"name":"get_current_temperature","arguments":{"location":"San Francisco, CA, USA","unit":"celsius"}}}
]
}
]
```
This is implemented roughly in `try_parse_tool_calls` but keep that in mind if you are writing your own tool call parser.
The current temperature in San Francisco is approximately 26.1°C. Tomorrow, on October 1, 2024, the temperature is expected to be around 25.9°C.<|im_end|>
```
Add the result text as an assistant message and the final messages should be ready for further interaction:
Ollama is a set of tools for serving LLMs locally.
It also relies on its template implementation to support function calling.
Different from transformers, which is written in Python and uses the Jinja template whose syntax is heavily inspired by Django and Python, Ollama, which is mostly written in Go, uses Go's [text/template](https://pkg.go.dev/text/template) packages.
In addition, Ollama implements internally a helper function so that it can automatically parse the generated tool calls in texts to structured messages if the format supported.
You could check the [Tool support](https://ollama.com/blog/tool-support) blog post first.
Tool support has been available in Ollama since v0.3.0.
You can run the following to check the Ollama version:
```bash
ollama -v
```
If lower than expected, follow [the official instructions](https://ollama.com/download) to install the latest version.
In this guide, we will aslo use [ollama-python](https://github.com/ollama/ollama-python), before starting, make sure it is available in your environment:
```bash
pip install ollama
```
For this guide, the `ollama` binary is at v0.3.9 and the `ollama` Python library is at v0.3.2.
#### Preparing
The messages structure used in Ollama is the same with that in `transformers` and the template in [Qwen2.5 Ollama models](https://ollama.com/library/qwen2.5) has supported tool use.
The inputs are the same with those in [the preparation code](#prepcode):
```python
tools=TOOLS
messages=MESSAGES[:]
model_name="qwen2.5:7b"
```
Note that you cannot pass Python functions as tools directly and `tools` has to be a `dict`.
#### Tool Calls and Tool Results
We can use the `ollama.chat` method to directly query the underlying API:
```python
importollama
response=ollama.chat(
model=model_name,
messages=messages,
tools=tools,
)
```
The main fields in the response could be:
```python
{
'model':'qwen2.5:7b',
'message':{
'role':'assistant',
'content':'',
'tool_calls':[
{'function':{'name':'get_current_temperature','arguments':{'location':'San Francisco, CA, USA'}}},
{'function':{'name':'get_temperature_date','arguments':{'date':'2024-10-01','location':'San Francisco, CA, USA'}}},
],
},
}
```
Ollama's tool call parser has succeeded in parsing the tool results.
If not, you may refine [the `try_parse_tool_calls` function above](#parse-function).
Then, we can obtain the tool results and add them to the messages.
The following is basically the same with `transformers`:
{'role':'assistant','content':'The current temperature in San Francisco is approximately 26.1°C. For tomorrow, October 1st, 2024, the forecasted temperature will be around 25.9°C.'}
```
(heading-target)=
### vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving.
It uses the tokenizer from `transformers` to format the input, so we should have no trouble preparing the input.
In addition, vLLm also implements helper functions so that generated tool calls can be parsed automatically if the format is supported.
Tool support has been available in `vllm` since v0.6.0.
Be sure to install a version that supports tool use.
For more information, check the [vLLM documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api).
For this guide, we are at version v0.6.1.post2.
We will use the OpenAI-Compatible API by `vllm` with the API client from the `openai` Python library.
#### Preparing
For Qwen2.5, the chat template in tokenizer_config.json has already included support for the Hermes-style tool use.
We simply need to start a OpenAI-compatible API with vLLM:
{'id':'chatcmpl-tool-924d705adb044ff88e0ef3afdd155f15','function':{'arguments':'{"location": "San Francisco, CA, USA"}','name':'get_current_temperature'},'type':'function'},
{'id':'chatcmpl-tool-7e30313081944b11b6e5ebfd02e8e501','function':{'arguments':'{"location": "San Francisco, CA, USA", "date": "2024-10-01"}','name':'get_temperature_date'},'type':'function'},
]},
{'role':'tool','content':'{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}','tool_call_id':'chatcmpl-tool-924d705adb044ff88e0ef3afdd155f15'},
| Call Format | Single assistant message with `tool_calls` | Single assistant message with `tool_calls` | Single assistant message with `tool_calls` | Single assistant message with `tool_calls` | Multiple assistant messages with `function_call` |
| Call Result Format | Multiple tool messages with `content` | Multiple tool messages with `content` | Multiple tool messages with `content` | Multiple tool messages with `content` | Multiple function messages with `content` |
There are some details not shown in the above table:
- OpenAI API comes with Python, Node.js, Go, and .NET SDKs. It also follows the OpenAPI standard.
- Ollama comes with Python and Node.js SDKs. It has OpenAI-compatible API at a different base url that can be accessed using OpenAI API SDK.
- Qwen-Agent as an application framework can call the tools automatically for you, which is introduced in [the Qwen-Agent guide](./qwen_agent).
In addition, there are more on the model side of function calling, which means you may need to consider more things in production code:
-**Accuracy of function calling**:
When it comes to evaluate the accuracy of function calling, there are two aspects:
(a) whether the correct functions (including no functions) are selected and
(b) whether the correct function arguments are generated.
It is not always the case that Qwen2.5 will be accurate.
Function calling can involve knowledge that is deep and domain-specific.
Sometimes, it doesn't fully understand the function and select the wrong one by mistake.
Sometimes, it can fall into a loop and require calling the same function again and again.
Sometimes, it will fabricate required function arguments instead of asking the user for input.
To improve the function calling accuracy, it is advised to first try prompt engineering:
does a more detailed function description help?
can we provide instructions and examples to the model in the system message?
If not, finetuning on your own data could also improve performance.
-**Protocol consistency**:
Even with the proper function calling template, the protocol may break.
The model may generate extra texts to tool calls, e.g., explanations.
The generated tool call may be invalid JSON-formatted string but a representation of a Python dict
The generated tool call may be valid JSON but not conforms to the provided JSON Schema.
For those kinds of issues, while some of them could be addressed with prompt engineering, some are caused by the nature of LLMs and can be hard to resolve in a general manner by LLMs themselves.
While we strive to improve Qwen2.5 in this regard, edge cases are unlikely to be eliminated completely.
## Function Calling Templates
The template design for function calling often includes the following aspects:
- How to describe the functions to the model, so that the model understands what they are and how to use them.
- How to prompt the model, so that it knows that functions can be used and in what format to generate the function calls.
- How to tell a function call generation from others in generated text, so that we can extract the calls from the generated texts and actually make the calls.
- How to incorporate the function results to the text, so that the model can tell them from its own generation and make connection among the calls and the results.
For experienced prompt engineers, it should be possible to make any LLM support function calling, using in-context learning techniques and with representative examples, though with varied accuracy and stability depending on how "zero-shot" the task at hand is.
### Starting from ReAct Prompting
For example, ReAct Prompting can be used to implement function calling with an extra element of planning:
-**Thought**: the overt reasoning path, analyzing the functions and the user query and saying it out "loud"
-**Action**: the function to use and the arguments with which the function should be called
-**Observation**: the results of the function
In fact, Qwen2 is verse in the following variant of ReAct Prompting (similar to LangChain ReAct) to make the intermediate texts more structured:
```
Answer the following questions as best you can. You have access to the following tools:
{function_name}: Call this tool to interact with the {function_name_human_readable} API. What is the {function_name_human_readable} API useful for? {function_desciption} Parameters: {function_parameter_descriptions} {argument_formatting_instructions}
{function_name}: Call this tool to interact with the {function_name_human_readable} API. What is the {function_name_human_readable} API useful for? {function_desciption} Parameters: {function_parameter_descriptions} {argument_formatting_instructions}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{function_name},{function_name}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {query}
Thought: {some_text}
Action: {function_name}
Action Input: {function_arguments}
Observation: {function_results}
Final Answer: {response}
```
As you can see, there is no apparent user/assistant conversation structure in the template.
The model will simply continue the texts.
One should write the code to actively detect which step the model is at and in particular to add the observations in the process, until the Final Answer is generated.
However, as most programming interfaces accept the message structure, there should be some kind of adapter between the two.
[The ReAct Chat Agent](https://github.com/QwenLM/Qwen-Agent/blob/v0.0.10/qwen_agent/agents/react_chat.py) in Qwen-Agent facilitates this kind of conversion.
### Qwen2 Function Calling Template
As a step forward, the official Qwen2 function calling template is in the vein of the ReAct Prompting format but focuses more on
- differentiating the keywords like `Question`, `Thought`, `Action`, etc., from generation,
## When you need to call a tool, please insert the following command in your reply, which can be called zero or multiple times according to your needs:
✿FUNCTION✿: The tool to use, should be one of [{function_name},{function_name}]
✿ARGS✿: The input of the tool
✿RESULT✿: Tool results
✿RETURN✿: Reply based on tool results. Images need to be rendered as <|im_end|>
<|im_start|>user
{query}<|im_end|>
<|im_start|>assistant
✿FUNCTION✿: {function_name}
✿ARGS✿: {function_arguments}
✿RESULT✿: {function_result}
✿RETURN✿:{response}<|im_end|>
```
Let's first list the obvious differences:
- Keywords (`✿FUNCTION✿`, `✿ARGS✿`, etc.) seem rare in ordinary text and more semantically related to function calling, but not special tokens yet.
- Thought is omitted. This could affect accuracy for some use cases.
- Use the system-user-assistant format for multi-turn conversations. Function calling prompting is moved to the system message.
How about adding controls for specialized usage?
The template actually has the following variants:
- Language: the above is for non-Chinese language; there is another template in Chinese.
- Parallel Calls: the above is for non-parallel calls; there is another template for parallel calls.
In the canonical implementation in Qwen-Agent, those switches are implemented in Python, according to the configuration and current input.
The actual text with _parallel calls_ should be like the following:
```text
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
Current Date: 2024-09-30
## Tools
You have access to the following tools:
### get_current_temperature
get_current_temperature: Get current temperature at a location. Parameters: {"type": "object", "properties": {"location": {"type": "string", "description": "The location to get the temperature for, in the format \"City, State, Country\"."}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit to return the temperature in. Defaults to \"celsius\"."}}, "required": ["location"]} Format the arguments as a JSON object.
### get_temperature_date
get_temperature_date: Get temperature at a location and date. Parameters: {"type": "object", "properties": {"location": {"type": "string", "description": "The location to get the temperature for, in the format \"City, State, Country\"."}, "date": {"type": "string", "description": "The date to get the temperature for, in the format \"Year-Month-Day\"."}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit to return the temperature in. Defaults to \"celsius\"."}}, "required": ["location", "date"]} Format the arguments as a JSON object.
## Insert the following command in your reply when you need to call N tools in parallel:
✿FUNCTION✿: The name of tool 1, should be one of [get_current_temperature,get_temperature_date]
✿ARGS✿: The input of tool 1
✿FUNCTION✿: The name of tool 2
✿ARGS✿: The input of tool 2
...
✿FUNCTION✿: The name of tool N
✿ARGS✿: The input of tool N
✿RESULT✿: The result of tool 1
✿RESULT✿: The result of tool 2
...
✿RESULT✿: The result of tool N
✿RETURN✿: Reply based on tool results. Images need to be rendered as <|im_end|>
<|im_start|>user
What's the temperature in San Francisco now? How about tomorrow?<|im_end|>
<|im_start|>assistant
✿FUNCTION✿: get_current_temperature
✿ARGS✿: {"location": "San Francisco, CA, USA"}
✿FUNCTION✿: get_temperature_date
✿ARGS✿: {"location": "San Francisco, CA, USA", "date": "2024-10-01"}
✿RESULT✿: {"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}
✿RETURN✿: The current temperature in San Francisco is approximately 26.1°C. For tomorrow, October 1st, 2024, the forecasted temperature will be around 25.9°C.<|im_end|>
```
This template is hard to adapt it for other frameworks that use less capable templating engines.
But it is doable at least partially for Jinja, which is Python-oriented after all.
We didn't use it because using the template in `transformers` leads to more changes to the inference usage, which are not very common for beginners.
For the interested, you can find the Jinja template and key points on usage below:
:::{dropdown} Qwen2 Function Calling Jinja Template
```jinja
{%-ifmessages[0]["role"]=="system"%}
{%-setsystem_message=messages[0]["content"]%}
{%-setloop_messages=messages[1:]%}
{%-else%}
{%-setsystem_message="You are a helpful assistant."%}
{%-setloop_messages=messages%}
{%-endif%}
{%-ifparallel_tool_callsisundefined%}
{%-setparallel_tool_calls=false%}
{%-endif%}
{%-iflanguageisundefinedorlanguage!="zh"%}
{%-setlanguage="en"%}
{%-endif%}
{{-"<|im_start|>system\n"+system_message|trim}}
{%-iftoolsisdefined%}
{{-"\n\n# 工具\n\n## 你拥有如下工具:\n\n"iflanguage=="zh"else"\n\n## Tools\n\nYou have access to the following tools:\n\n"}}
{{-"### "+function.name+"\n\n"+function.name+": "+function.description+(" 输入参数:"iflanguage=="zh"else" Parameters: ")+function.parameters|tojson+(" 此工具的输入应为JSON对象。\n\n"iflanguage=="zh"else" Format the arguments as a JSON object.\n\n")}}
{{-"## Insert the following command in your reply when you need to call N tools in parallel:\n\n✿FUNCTION✿: The name of tool 1, should be one of ["+function_names+"]\n✿ARGS✿: The input of tool 1\n✿FUNCTION✿: The name of tool 2\n✿ARGS✿: The input of tool 2\n...\n✿FUNCTION✿: The name of tool N\n✿ARGS✿: The input of tool N\n✿RESULT✿: The result of tool 1\n✿RESULT✿: The result of tool 2\n...\n✿RESULT✿: The result of tool N\n✿RETURN✿: Reply based on tool results. Images need to be rendered as "}}
{{-"## When you need to call a tool, please insert the following command in your reply, which can be called zero or multiple times according to your needs:\n\n✿FUNCTION✿: The tool to use, should be one of ["+function_names+"]\n✿ARGS✿: The input of the tool\n✿RESULT✿: Tool results\n✿RETURN✿: Reply based on tool results. Images need to be rendered as "}}
- Switches can be enabled by passing them to the `apply_chat_template` method, e.g., `tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, parallel_tool_call=True, language="zh", tokenize=False)`. By default, it is for English non-parallel function calling.
- The tool arguments should be a Python `dict` instead of a JSON-formatted object `str`.
- Since the generation needs to be stopped at `✿RESULT✿` or else the model will generate fabricated tool results, we should add it to `stop_strings` in `generation_config`:
- As a result of using `stop_strings`, you need to pass the tokenizer to `model.generate` as `model.generate(**inputs, tokenizer=tokenizer, max_new_tokens=512)`.
-`response`, i.e., the model generation based on the tool calls and tool results, may contain a leading space. You should not strip it for the model. It is resulted from the tokenization and the template design.
- The `try_parse_tool_calls` function should also be modified accordingly.
:::
### Qwen2.5 Function Calling Templates
For `transformers` and Ollama, we have also used templates that are easier to implement with Jinja or Go.
They are variants of [the Nous Research's Hermes function calling template](https://github.com/NousResearch/Hermes-Function-Calling#prompt-format-for-function-calling).
The Jinja template and the Go template should produce basically the same results.
They final text should look like the following:
```text
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
Current Date: 2024-09-30
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "get_current_temperature", "description": "Get current temperature at a location.", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The location to get the temperature for, in the format \"City, State, Country\"."}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit to return the temperature in. Defaults to \"celsius\"."}}, "required": ["location"]}}}
{"type": "function", "function": {"name": "get_temperature_date", "description": "Get temperature at a location and date.", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The location to get the temperature for, in the format \"City, State, Country\"."}, "date": {"type": "string", "description": "The date to get the temperature for, in the format \"Year-Month-Day\"."}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit to return the temperature in. Defaults to \"celsius\"."}}, "required": ["location", "date"]}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
The current temperature in San Francisco is approximately 26.1°C. Tomorrow, on October 1, 2024, the temperature is expected to be around 25.9°C.<|im_end|>
```
While the text may seem different from the previous one, the basic prompting structure is still the same.
There are just more structural tags and more JSON-formatted strings.
---
There is one thing we haven't talked about: how should functions be described to the LLMs.
In short, you could describe them as you would normally describe them in an API documentation, as long as you can effectively parse, validate, and execute the tool calls generated by the models.
The format with JSON Schema appears a valid and common choice.
## Finally
In whichever way you choose to use function calling with Qwen2.5, keep in mind that the limitation and the perks of prompt engineering applies:
- It is not guaranteed that the model generation will always follow the protocol even with proper prompting or templates.
Especially, for the templates that are more complex and relies more on the model itself to think and stay on track than the ones that are simpler and relies on the template and the use of control or special tokens.
The latter one, of course, requires some kind of training.
In production code, be prepared that if it breaks, countermeasures or rectifications are in place.
- If in certain scenarios, the generation is not up to expectation, you can refine the template to add more instructions or constraints.
While the templates mentioned here are general enough, they may not be the best or the most specific or the most concise for your use cases.
The ultimate solution is fine-tuning using your own data.
Qwen (Chinese: 通义千问; pinyin: _Tongyi Qianwen_) is the large language model and large multimodal model series of the Qwen Team, Alibaba Group.
Qwen is capable of natural language understanding, text generation, vision understanding, audio understanding, tool use, role play, playing as AI agent, etc.
Both language models and multimodal models are pre-trained on large-scale multilingual and multimodal data and post-trained on quality data for aligning to human preferences.
There is the proprietary version and the open-weight version.
The proprietary versions include
- Qwen: the language models
- Qwen Max
- Qwen Plus
- Qwen Turbo
- Qwen-VL: the vision-language models
- Qwen-VL Max
- Qwen-VL Plus
- Qwen-VL OCR
- Qwen-Audio: the audio-language models
- Qwen-Audio Turbo
- Qwen-Audio ASR
You can learn more about them at Alibaba Cloud Model Studio ([China Site](https://help.aliyun.com/zh/model-studio/getting-started/models#9f8890ce29g5u)\[zh\], [International Site](https://www.alibabacloud.com/en/product/modelstudio)).
The spectrum for the open-weight models spans over
- Qwen: the language models
-[Qwen](https://github.com/QwenLM/Qwen): 1.8B, 7B, 14B, and 72B models
-[Qwen2.5-Math-PRM](https://github.com/QwenLM/Qwen2.5-Math): 7B and 72B models
**In this document, our focus is Qwen, the language models.**
## Causal Language Models
Causal language models, also known as autoregressive language models or decoder-only language models, are a type of machine learning model designed to predict the next token in a sequence based on the preceding tokens.
In other words, they generate text one token at a time, using the previously generated tokens as context.
The "causal" aspect refers to the fact that the model only considers the past context (the already generated tokens) when predicting the next token, not any future tokens.
Causal language models are widely used for various natural language processing tasks involving text completion and generation.
They have been particularly successful in generating coherent and contextually relevant text, making them a cornerstone of modern natural language understanding and generation systems.
**Takeaway: Qwen models are causal language models suitable for text completion.**
:::{dropdown} Learn more about language models
They are three main kinds of models that are commonly referred to as language models in deep learning:
- Sequence-to-sequence models: T5 and the likes
Sequence-to-sequence models use both an encoder to capture the entire input sequence and a decoder to generate an output sequence.
They are widely used for tasks like machine translation, text summarization, etc.
- Bidirectional models or encoder-only models: BERT and the likes
Bidirectional models can access both past and future context in a sequence during training.
They cannot generate sequential outputs in real-time due to the need for future context.
They are widely used as embedding models and subsequently used for text classification.
- Casual language models or decoder-only models: GPT and the likes
Causal language models operate unidirectionally in a strictly forward direction, predicting each subsequent word based only on the previous words in the sequence.
This unidirectional nature ensures that the model's predictions do not rely on future context, making them suitable for tasks like text completion and generation.
:::
### Pre-training & Base models
Base language models are foundational models trained on extensive corpora of text to predict the next word in a sequence.
Their main goal is to capture the statistical patterns and structures of language, enabling them to generate coherent and contextually relevant text.
These models are versatile and can be adapted to various natural language processing tasks through fine-tuning.
While adept at producing fluent text, they may require in-context learning or additional training to follow specific instructions or perform complex reasoning tasks effectively.
For Qwen models, the base models are those without "-Instruct" indicators, such as Qwen2.5-7B and Qwen2.5-72B.
**Takeaway: Use base models for in-context learning, downstream fine-tuning, etc.**
### Post-training & Instruction-tuned models
Instruction-tuned language models are specialized models designed to understand and execute specific instructions in conversational styles.
These models are fine-tuned to interpret user commands accurately and can perform tasks such as summarization, translation, and question answering with improved accuracy and consistency.
Unlike base models, which are trained on large corpora of text, instruction-tuned models undergo additional training using datasets that contain examples of instructions and their desired outcomes, often in multiple turns.
This kind of training makes them ideal for applications requiring targeted functionalities while maintaining the ability to generate fluent and coherent text.
For Qwen models, the instruction-tuned models are those with the "-Instruct" suffix, such as Qwen2.5-7B-Instruct and Qwen2.5-72B-Instruct. [^instruct-chat]
**Takeaway: Use instruction-tuned models for conducting tasks in conversations, downstream fine-tuning, etc.**
[^instruct-chat]:Previously, they are known as the chat models and with the "-Chat" suffix. Starting from Qwen2, the name is changed to follow the common practice. For Qwen, "-Instruct" and "-Chat" should be regarded as synonymous.
## Tokens & Tokenization
Tokens represent the fundamental units that models process and generate.
They can represent texts in human languages (regular tokens) or represent specific functionality like keywords in programming languages (control tokens [^special]).
Typically, a tokenizer is used to split text into regular tokens, which can be words, subwords, or characters depending on the specific tokenization scheme employed, and furnish the token sequence with control tokens as needed.
The vocabulary size, or the total number of unique tokens a model recognizes, significantly impacts its performance and versatility.
Larger language models often use sophisticated tokenization methods to handle the vast diversity of human language while keeping the vocabulary size manageable.
Qwen use a relatively large vocabulary of 151,646 tokens in total.
[^special]:Control tokens can be called special tokens. However, the meaning of special tokens need to be interpreted based on the contexts: special tokens may contain extra regular tokens.
**Takeaway: Tokenization method and vocabulary size is important.**
### Byte-level Byte Pair Encoding
Qwen adopts a subword tokenization method called Byte Pair Encoding (BPE), which attempts to learn the composition of tokens that can represent the text with the fewest tokens.
For example, the string " tokenization" is decomposed as " token" and "ization" (note that the space is part of the token).
Especially, the tokenization of Qwen ensures that there is no unknown words and all texts can be transformed to token sequences.
There are 151,643 tokens as a result of BPE in the vocabulary of Qwen, which is a large vocabulary efficient for diverse languages.
As a rule of thumb, 1 token is 3~4 characters for English texts and 1.5~1.8 characters for Chinese texts.
**Takeaway: Qwen processes texts in subwords and there are no unknown words.**
:::{dropdown} Learn more about tokenization in Qwen
Qwen uses byte-level BPE (BBPE) on UTF-8 encoded texts.
It starts by treating each byte as a token and then iteratively merges the most frequent pairs of tokens occurring the texts into larger tokens until the desired vocabulary size is met.
In byte-level BPE, minimum 256 tokens are needed to tokenize every piece of text and avoid the out of vocabulary (OOV) problem.
In comparison, character-level BPE needs every Unicode character in its vocabulary to avoid OOV and the Unicode Standard contains 154,998 characters as of Unicode Version 16.0.
One limitation to keep in mind for byte-level BPE is that the individual tokens in the vocabulary may not be seemingly semantically meaningful or even valid UTF-8 byte sequences, and in certain aspects, they should be viewed as a text compression scheme.
:::
### Control Tokens & Chat Template
Control tokens and chat templates both serve as mechanisms to guide the model's behavior and outputs.
Control tokens are special tokens inserted into the sequence that signifies meta information.
For example, in pre-training, multiple documents may be packed into a single sequence.
For Qwen, the control token "<|endoftext|>" is inserted after each document to signify that the document has ended and a new document will proceed.
Chat templates provide a structured format for conversational interactions, where predefined placeholders or prompts are used to elicit responses from the model that adhere to a desired dialogue flow or context.
Different models may use different kinds of chat template to format the conversations.
It is crucial to use the designated one to ensure the precise control over the LLM's generation process.
Qwen uses the following format (ChatML[^chatml]), making use of control tokens to format each turn in the conversations
```text
<|im_start|>{{role}}
{{content}}<|im_end|>
```
The user input take the role of `user` and the model generation takes the role of `assistant`.
Qwen also supports the meta message that instruct the model to perform specific actions or generate text with certain characteristics, such as altering tone, style, or content, which takes the role of `system` and the content defaults to "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
The following is a full example:
```text
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
Hello! How can I assist you today? Let me know if you have any questions or need help with something. I'm here to help!<|im_end|>
<|im_start|>user
Explain large language models like I'm 5.<|im_end|>
<|im_start|>assistant
Sure, I'd be happy to explain large language models in a simple way!
Imagine that you have a friend who is really good at playing with words. This friend has read so many books and talked to so many people that they know how to use words in all sorts of different ways. They can tell stories, answer questions, and even make up new words if they need to.
Now imagine that this friend is actually a computer program, called a "large language model". It's been trained on lots and lots of text, like books, articles, and conversations, so it knows how to use words just like your word-loving friend does. When you ask the model a question or give it a task, it uses all the knowledge it's learned to come up with a response that makes sense.
Just like your friend might sometimes make mistakes or say things in a funny way, the large language model isn't perfect either. But it's still really impressive, because it can understand and generate human language in a way that was once thought impossible for machines!<|im_end|><|endoftext|>
```
Starting from Qwen2.5, the Qwen model family including multimodal and specialized models will use a unified vocabulary, which contains control tokens from all subfamilies.
There are 22 control tokens in the vocabulary of Qwen2.5, making the vocabulary size totaling 151,665:
- 1 general: `<|endoftext|>`
- 2 for chat: `<|im_start|>` and `<|im_end|>`
- 2 for tool use: `<tool_call>` and `</tool_call>`
- 11 for vision
- 6 for coding
**Takeaway: Qwen uses ChatML with control tokens for chat template.**
[^chatml]:For historical reference only, ChatML is first described by the OpenAI Python SDK. The last available version is [this](https://github.com/openai/openai-python/blob/v0.28.1/chatml.md). Please also be aware that that document lists use cases intended for OpenAI models. For Qwen2.5 models, please only use as in our guide.
## Length Limit
As Qwen models are causal language models, in theory there is only one length limit of the entire sequence.
However, since there is often packing in training and each sequence may contain multiple individual pieces of texts.
**How long the model can generate or complete ultimately depends on the use case and in that case how long each document (for pre-training) or each turn (for post-training) is in training.**
For Qwen2.5, the packed sequence length in training is 32,768 tokens.[^yarn]
The maximum document length in pre-training is this length.
The maximum message length for user and assistant is different in post-training.
In general, the assistant message could be up to 8192 tokens.
[^yarn]:The sequence length can be extended to 131,072 tokens for Qwen2.5-7B, Qwen2.5-14B, Qwen2.5-32B, and Qwen2.5-72B models with YaRN.
Please refer to the model card on how to enable YaRN in vLLM.
**Takeaway: Qwen2.5 models can process texts of 32K or 128K tokens and up to 8K tokens can be assistant output.**
We provide examples of [Hugging Face Transformers](https://github.com/huggingface/transformers) as well as [ModelScope](https://github.com/modelscope/modelscope), and [vLLM](https://github.com/vllm-project/vllm) for deployment.
You can find Qwen3 models in [the Qwen3 collection](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) at HuggingFace Hub and [the Qwen3 collection](https://www.modelscope.cn/collections/Qwen3-9743180bdc6b48) at ModelScope.
## Transformers
To get a quick start with Qwen3, you can try the inference with `transformers` first.
Make sure that you have installed `transformers>=4.51.0`.
We advise you to use Python 3.10 or higher, and PyTorch 2.6 or higher.
The following is a very simple code snippet showing how to run Qwen3-8B:
Qwen3 will think before respond, similar to QwQ models.
This means the model will use its reasoning abilities to enhance the quality of generated responses.
The model will first generate thinking content wrapped in a `<think>...</think>` block, followed by the final response.
- Hard Switch:
To strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models, you can set `enable_thinking=False` when formatting the text.
* [new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
* [new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
* [new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
Qwen is the large language model and large multimodal model series of the Qwen Team, Alibaba Group. Both language models and multimodal models are pretrained on large-scale multilingual and multimodal data and post-trained on quality data for aligning to human preferences.
Qwen is capable of natural language understanding, text generation, vision understanding, audio understanding, tool use, role play, playing as AI agent, etc.
The latest version, Qwen3, has the following features:
- **Dense and Mixture-of-Experts (MoE) models**, available in 0.6B, 1.7B, 4B, 8B, 14B, 32B and 30B-A3B, 235B-A22B.
- **Seamless switching between thinking mode** (for complex logical reasoning, math, and coding) and **non-thinking mode** (for efficient, general-purpose chat) **within a single model**, ensuring optimal performance across various scenarios.
- **Significantly enhancement in reasoning capabilities**, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
- **Superior human preference alignment**, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
- **Expertise in agent capabilities**, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
- **Support of 100+ languages and dialects** with strong capabilities for **multilingual instruction following** and **translation**.
Join our community by joining our `Discord <https://discord.gg/yPEP2vHTu4>`__ and `WeChat <https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png>`__ group. We are looking forward to seeing you there!
-**Device Placement**: `device_map="auto"` will load the model parameters to multiple devices automatically, if available.
It relies on the `accelerate` pacakge.
If you would like to use a single device, you can pass `device` instead of device_map.
`device=-1` or `device="cpu"` indicates using CPU, `device="cuda"` indicates using the current GPU, and `device="cuda:1"` or `device=1` indicates using the second GPU.
Do not use `device_map` and `device` at the same time!
-**Compute Precision**: `torch_dtype="auto"` will determine automatically the data type to use based on the original precision of the checkpoint and the precision your device supports.
For modern devices, the precision determined will be `bfloat16`.
If you don't pass `torch_dtype="auto"`, the default data type is `float32`, which will take double the memory and be slower in computation.
Calls to the text generation pipleine will use the generation configuration from the model file, e.g., `generation_config.json`.
Those configuration could be overridden by passing arguments directly to the call.
If you would like a more structured assistant message format, you can use the following function to extract the thinking content into a field named `reasoning_content` which is similar to the format used by vLLM, SGLang, etc.
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9, that is, Ada Lovelace, Hopper, and later GPUs.
For better performance, make sure `triton` and a CUDA compiler compatible with the CUDA version of `torch` in your environment are installed.
:::
:::{important}
As of 4.51.0, there are issues with Tranformers when running those checkpoints **across GPUs**.
The following method could be used to work around those issues:
- Set the environmnt variable `CUDA_LAUNCH_BLOCKING=1` before running the script; or
- Uncomment [this line](https://github.com/huggingface/transformers/blob/0720e206c6ba28887e4d60ef60a6a089f6c1cc76/src/transformers/integrations/finegrained_fp8.py#L340) in your local installation of `transformers`.
:::
## Enabling Long Context
The maximum context length in pre-training for Qwen3 models is 32,768 tokens.
It can be extended to 131,072 tokens with RoPE scaling techniques.
We have validated the performance with YaRN.
Transformers supports YaRN, which can be enabled either by modifying the model files or overriding the default arguments when loading the model.
- Modifying the model files: In the config.json file, add the rope_scaling fields:
```json
{
...,
"rope_scaling":{
"type":"yarn",
"factor":4.0,
"original_max_position_embeddings":32768
}
}
```
- Overriding the default arguments:
```python
fromtransformersimportpipeline
model_name_or_path="Qwen/Qwen3-8B"
generator=pipeline(
"text-generation",
model_name_or_path,
torch_dtype="auto",
device_map="auto",
model_kwargs={
"rope_scaling":{
"type":"yarn",
"factor":4.0,
"original_max_position_embeddings":32768
}
}
)
```
:::{note}
Transformers implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
We advise adding the `rope_scaling` configuration only when processing long contexts is required.
It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
:::
## Streaming Generation
With the help of `TextStreamer`, you can modify your chatting with Qwen3 to streaming mode.
It will print the response as being generated to the console or the terminal.
Besides using `TextStreamer`, we can also use `TextIteratorStreamer` which stores print-ready text in a queue, to be used by a downstream application as an iterator:
You may find distributed inference with Transformers is not as fast as you would imagine.
Transformers with `device_map="auto"` does not apply tensor parallelism and it only uses one GPU at a time.
For Transformers with tensor parallelism, please refer to [its documentation](https://huggingface.co/docs/transformers/v4.51.3/en/perf_infer_gpu_multi).
For quantized models, one of our recommendations is the usage of [AWQ](https://arxiv.org/abs/2306.00978) with [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
**AWQ** refers to Activation-aware Weight Quantization, a hardware-friendly approach for LLM low-bit weight-only quantization.
**AutoAWQ** is an easy-to-use Python library for 4-bit quantized models.
AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16.
AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs.
In this document, we show you how to use the quantized model with Hugging Face `transformers` and also how to quantize your own model.
## Usage of AWQ Models with Hugging Face transformers
Now, `transformers` has officially supported AutoAWQ, which means that you can directly use the quantized model with `transformers`.
The following is a very simple code snippet showing how to run `Qwen2.5-7B-Instruct-AWQ` with the quantized model:
vLLM has supported AWQ, which means that you can directly use our provided AWQ models or those quantized with `AutoAWQ` with vLLM.
We recommend using the latest version of vLLM (`vllm>=0.6.1`) which brings performance improvements to AWQ models; otherwise, the performance might not be well-optimized.
Actually, the usage is the same with the basic usage of vLLM.
We provide a simple example of how to launch OpenAI-API compatible API with vLLM and `Qwen2.5-7B-Instruct-AWQ`:
Run the following in a shell to start an OpenAI-compatible API service:
[GPTQ](https://arxiv.org/abs/2210.17323) is a quantization method for GPT-like LLMs, which uses one-shot weight quantization based on approximate second-order information.
In this document, we show you how to use the quantized model with Hugging Face `transformers` and also how to quantize your own model with [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ).
## Usage of GPTQ Models with Hugging Face transformers
:::{note}
To use the official Qwen2.5 GPTQ models with `transformers`, please ensure that `optimum>=1.20.0` and compatible versions of `transformers` and `auto_gptq` are installed.
You can do that by
```bash
pip install-U"optimum>=1.20.0"
```
:::
Now, `transformers` has officially supported AutoGPTQ, which means that you can directly use the quantized model with `transformers`.
For each size of Qwen2.5, we provide both Int4 and Int8 GPTQ quantized models.
The following is a very simple code snippet showing how to run `Qwen2.5-7B-Instruct-GPTQ-Int4`:
:Framework: vLLM, AutoGPTQ (including Hugging Face transformers)
:Description: Generation cannot stop properly. Continual generation after where it should stop, then repeated texts, either single character, a phrase, or paragraphs, are generated.
:Workaround: The following workaround could be considered
1. Using the original model in 16-bit floating point
2. Using the AWQ variants or llama.cpp-based models for reduced chances of abnormal generation
### Qwen2.5-32B-Instruct-GPTQ-Int4 broken with vLLM on multiple GPUs
:Model: Qwen2.5-32B-Instruct-GPTQ-Int4
:Framework: vLLM
:Description: Deployment on multiple GPUs and only garbled text like `!!!!!!!!!!!!!!!!!!` could be generated.
:Workaround: Each of the following workaround could be considered
1. Using the AWQ or GPTQ-Int8 variants
2. Using a single GPU
3. Using Hugging Face `transformers` if latency and throughput are not major concerns
## Troubleshooting
:::{dropdown} With `transformers` and `auto_gptq`, the logs suggest `CUDA extension not installed.` and the inference is slow.
`auto_gptq` fails to find a fused CUDA kernel compatible with your environment and falls back to a plain implementation.
Follow its [installation guide](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md) to install a pre-built wheel or try installing `auto_gptq` from source.
:::
:::{dropdown} Self-quantized Qwen2.5-72B-Instruct-GPTQ with `vllm`, `ValueError: ... must be divisible by ...` is raised. The intermediate size of the self-quantized model is different from the official Qwen2.5-72B-Instruct-GPTQ models.
After quantization the size of the quantized weights are divided by the group size, which is typically 128.
The intermediate size for the FFN blocks in Qwen2.5-72B is 29568.
Unfortunately, {math}`29568 \div 128 = 231`.
Since the number of attention heads and the dimensions of the weights must be divisible by the tensor parallel size, it means you can only run the quantized model with `tensor_parallel_size=1`, i.e., one GPU card.
A workaround is to make the intermediate size divisible by {math}`128 \times 8 = 1024`.
To achieve that, the weights should be padded with zeros.
While it is mathematically equivalent before and after zero-padding the weights, the results may be slightly different in reality.
Quantization is a major topic for local inference of LLMs, as it reduces the memory footprint.
Undoubtably, llama.cpp natively supports LLM quantization and of course, with flexibility as always.
At high-level, all quantization supported by llama.cpp is weight quantization:
Model parameters are quantized into lower bits, and in inference, they are dequantized and used in computation.
In addition, you can mix different quantization data types in a single quantized model, e.g., you can quantize the embedding weights using a quantization data type and other weights using a different one.
With an adequate mixture of quantization types, much lower quantization error can be attained with just a slight increase of bit-per-weight.
The example program `llama-quantize` supports many quantization presets, such as Q4_K_M and Q8_0.
If you find the quantization errors still more than expected, you can bring your own scales, e.g., as computed by AWQ, or use calibration data to compute an importance matrix using `llama-imatrix`, which can then be used during quantization to enhance the quality of the quantized models.
In this document, we demonstrate the common way to quantize your model and evaluate the performance of the quantized model.
We will assume you have the example programs from llama.cpp at your hand.
If you don't, check our guide [here](../run_locally/llama.cpp.html#getting-the-program){.external}.
## Getting the GGUF
Now, suppose you would like to quantize `Qwen3-8B`.
You need to first make a GGUF file as shown below:
You can find all the presets in [the source code of `llama-quantize`](https://github.com/ggml-org/llama.cpp/blob/master/examples/quantize/quantize.cpp).
Look for the variable `QUANT_OPTIONS`.
Common ones used for 7B models include `Q8_0`, `Q5_0`, and `Q4_K_M`.
The letter case doesn't matter, so `q8_0` or `q4_K_m` are perfectly fine.
Now you can use the GGUF file of the quantized model with applications based on llama.cpp.
Very simple indeed.
However, the accuracy of the quantized model could be lower than expected occasionally, especially for lower-bit quantization.
The program may even prevent you from doing that.
There are several ways to improve quality of quantized models.
A common way is to use a calibration dataset in the target domain to identify the weights that really matter and quantize the model in a way that those weights have lower quantization errors, as introduced in the next two methods.
## Quantizing the GGUF with AWQ Scale
:::{attention}
To be updated for Qwen3.
:::
To improve the quality of your quantized models, one possible solution is to apply the AWQ scale, following [this script](https://github.com/casper-hansen/AutoAWQ/blob/main/docs/examples.md#gguf-export).
First, when you run `model.quantize()` with `autoawq`, remember to add `export_compatible=True` as shown below:
```python
...
model.quantize(
tokenizer,
quant_config=quant_config,
export_compatible=True
)
model.save_pretrained(quant_path)
...
```
The above code will not actually quantize the weights.
Instead, it adjusts weights based on a dataset so that they are "easier" to quantize.[^AWQ]
Then, when you run `convert-hf-to-gguf.py`, remember to replace the model path with the path to the new model:
In this way, it should be possible to achieve similar quality with lower bit-per-weight.
[^AWQ]:If you are interested in what this means, refer to [the AWQ paper](https://arxiv.org/abs/2306.00978).
Basically, important weights (called salient weights in the paper) are identified based on activations across data examples.
The weights are scaled accordingly such that the salient weights are protected even after quantization.
## Quantizing the GGUF with Importance Matrix
Another possible solution is to use the "important matrix"[^imatrix], following [this](https://github.com/ggml-org/llama.cpp/tree/master/examples/imatrix).
First, you need to compute the importance matrix data of the weights of a model (`-m`) using a calibration dataset (`-f`):
For lower-bit quantization mixtures for 1-bit or 2-bit, if you do not provide `--imatrix`, a helpful warning will be printed by `llama-quantize`.
[^imatrix]:Here, the importance matrix keeps record of how weights affect the output: the weight should be important is a slight change in its value causes huge difference in the results, akin to the [GPTQ](https://arxiv.org/abs/2210.17323) algorithm.
## Perplexity Evaluation
`llama.cpp` provides an example program for us to calculate the perplexity, which evaluate how unlikely the given text is to the model.
It should be mostly used for comparisons: the lower the perplexity, the better the model remembers the given text.
To do this, you need to prepare a dataset, say "wiki test"[^wiki].
Wait for some time and you will get the perplexity of the model.
There are some numbers of different kinds of quantization mixture [here](https://github.com/ggml-org/llama.cpp/blob/master/examples/perplexity/README.md).
It might be helpful to look at the difference and grab a sense of how that kind of quantization might perform.
[^wiki]:It is not a good evaluation dataset for instruct models though, but it is very common and easily accessible.
You probably want to use a dataset similar to your target domain.
## Finally
In this guide, we demonstrate how to conduct quantization and evaluate the perplexity with llama.cpp.
For more information, please visit the [llama.cpp GitHub repo](https://github.com/ggml-org/llama.cpp).
We usually quantize the fp16 model to 4, 5, 6, and 8-bit models with different quantization mixtures, but sometimes a particular mixture just does not work, so we don't provide those in our HuggingFace Hub.
However, others in the community may have success, so if you haven't found what you need in our repos, look around.