Commit a52e53db authored by chenzk's avatar chenzk
Browse files

v1.0

parents
Pipeline #2680 canceled with stages
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
import sys
from sphinx.ext import autodoc
import logging
logger = logging.getLogger(__name__)
# -- Project information -----------------------------------------------------
project = "Qwen"
copyright = "2024, Qwen Team"
author = "Qwen Team"
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.napoleon",
"sphinx.ext.viewcode",
"sphinx.ext.intersphinx",
# "sphinx_copybutton",
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"myst_parser",
"sphinx_design",
]
myst_enable_extensions = ["colon_fence", "attrs_block", "attrs_inline", "fieldlist"]
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []
# Exclude the prompt "$" when copying code
copybutton_prompt_text = r"\$ "
copybutton_prompt_is_regexp = True
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_title = project
html_theme = "furo"
# html_logo = 'assets/logo/qwen.png'
# html_theme_options = {
# 'path_to_docs': 'docs/source',
# 'repository_url': 'https://github.com/QwenLM/Qwen2',
# # 'use_repository_button': True,
# }
html_sidebars = {
"**": [
"sidebar/scroll-start.html",
"sidebar/brand.html",
"sidebar/navigation.html",
"sidebar/ethical-ads.html",
"sidebar/scroll-end.html",
]
}
# multi-language docs
language = "en"
locale_dirs = ["../locales/"] # path is example but recommended.
gettext_compact = False # optional.
gettext_uuid = True # optional.
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]
html_css_files = [
"css/custom.css",
]
# FIXME: figure out why this file is not copied
html_js_files = [
"design-tabs.js",
]
# Mock out external dependencies here.
autodoc_mock_imports = ["torch", "transformers"]
for mock_target in autodoc_mock_imports:
if mock_target in sys.modules:
logger.info(
f"Potentially problematic mock target ({mock_target}) found; "
"autodoc_mock_imports cannot mock modules that have already "
"been loaded into sys.modules when the sphinx build starts."
)
class MockedClassDocumenter(autodoc.ClassDocumenter):
"""Remove note about base class when a class is derived from object."""
def add_line(self, line: str, source: str, *lineno: int) -> None:
if line == " Bases: :py:class:`object`":
return
super().add_line(line, source, *lineno)
autodoc.ClassDocumenter = MockedClassDocumenter
navigation_with_keys = False
OpenLLM
=======
.. attention::
To be updated for Qwen3.
OpenLLM allows developers to run Qwen2.5 models of different sizes as OpenAI-compatible APIs with a single command. It features a built-in chat UI, state-of-the-art inference backends, and a simplified workflow for creating enterprise-grade cloud deployment with Qwen2.5. Visit `the OpenLLM repository <https://github.com/bentoml/OpenLLM/>`_ to learn more.
Installation
------------
Install OpenLLM using ``pip``.
.. code:: bash
pip install openllm
Verify the installation and display the help information:
.. code:: bash
openllm --help
Quickstart
----------
Before you run any Qwen2.5 model, ensure your model repository is up to date by syncing it with OpenLLM's latest official repository.
.. code:: bash
openllm repo update
List the supported Qwen2.5 models:
.. code:: bash
openllm model list --tag qwen2.5
The results also display the required GPU resources and supported platforms:
.. code:: bash
model version repo required GPU RAM platforms
------- --------------------- ------- ------------------ -----------
qwen2.5 qwen2.5:0.5b default 12G linux
qwen2.5:1.5b default 12G linux
qwen2.5:3b default 12G linux
qwen2.5:7b default 24G linux
qwen2.5:14b default 80G linux
qwen2.5:14b-ggml-q4 default macos
qwen2.5:14b-ggml-q8 default macos
qwen2.5:32b default 80G linux
qwen2.5:32b-ggml-fp16 default macos
qwen2.5:72b default 80Gx2 linux
qwen2.5:72b-ggml-q4 default macos
To start a server with one of the models, use ``openllm serve`` like this:
.. code:: bash
openllm serve qwen2.5:7b
By default, the server starts at ``http://localhost:3000/``.
Interact with the model server
------------------------------
With the model server up and running, you can call its APIs in the following ways:
.. tab-set::
.. tab-item:: CURL
Send an HTTP request to its ``/generate`` endpoint via CURL:
.. code-block:: bash
curl -X 'POST' \
'http://localhost:3000/api/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Tell me something about large language models.",
"model": "Qwen/Qwen2.5-7B-Instruct",
"max_tokens": 2048,
"stop": null
}'
.. tab-item:: Python client
Call the OpenAI-compatible endpoints with frameworks and tools that support the OpenAI API protocol. Here is an example:
.. code-block:: python
from openai import OpenAI
client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')
# Use the following func to get the available models
# model_list = client.models.list()
# print(model_list)
chat_completion = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{
"role": "user",
"content": "Tell me something about large language models."
}
],
stream=True,
)
for chunk in chat_completion:
print(chunk.choices[0].delta.content or "", end="")
.. tab-item:: Chat UI
OpenLLM provides a chat UI at the ``/chat`` endpoint for the LLM server at http://localhost:3000/chat.
.. image:: ../../source/assets/qwen-openllm-ui-demo.png
Model repository
----------------
A model repository in OpenLLM represents a catalog of available LLMs. You can add your own repository to OpenLLM with custom Qwen2.5 variants for your specific needs. See our `documentation to learn details <https://github.com/bentoml/OpenLLM?tab=readme-ov-file#model-repository>`_.
\ No newline at end of file
# SGLang
[SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
To learn more about SGLang, please refer to the [documentation](https://docs.sglang.ai/).
## Environment Setup
By default, you can install `sglang` with pip in a clean environment:
```shell
pip install "sglang[all]>=0.4.6"
```
Please note that `sglang` relies on `flashinfer-python` and has strict dependencies on `torch` and its CUDA versions.
Check the note in the official document for installation ([link](https://docs.sglang.ai/start/install.html)) for more help.
## API Service
It is easy to build an OpenAI-compatible API service with SGLang, which can be deployed as a server that implements OpenAI API protocol.
By default, it starts the server at `http://localhost:30000`.
You can specify the address with `--host` and `--port` arguments.
Run the command as shown below:
```shell
python -m sglang.launch_server --model-path Qwen/Qwen3-8B
```
By default, if the `--model-path` does not point to a valid local directory, it will download the model files from the HuggingFace Hub.
To download model from ModelScope, set the following before running the above command:
```shell
export SGLANG_USE_MODELSCOPE=true
```
For distrbiuted inference with tensor parallelism, it is as simple as
```shell
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --tensor-parallel-size 4
```
The above command will use tensor parallelism on 4 GPUs.
You should change the number of GPUs according to your demand.
### Basic Usage
Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:
::::{tab-set}
:::{tab-item} curl
```shell
curl http://localhost:30000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
```
:::
:::{tab-item} Python
You can use the API client with the `openai` Python SDK as shown below:
```python
from openai import OpenAI
# Set OpenAI's API key and API base to use SGLang's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:30000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
temperature=0.6,
top_p=0.95,
top_k=20,
max_tokens=32768,
)
print("Chat response:", chat_response)
```
::::
:::{tip}
While the default sampling parameters would work most of the time for thinking mode,
it is recommended to adjust the sampling parameters according to your application,
and always pass the sampling parameters to the API.
:::
### Thinking & Non-Thinking Modes
:::{important}
This feature has not been released.
For more information, please see this [pull request](https://github.com/sgl-project/sglang/pull/5551).
:::
Qwen3 models will think before respond.
This behaviour could be controled by either the hard switch, which could disable thinking completely, or the soft switch, where the model follows the instruction of the user on whether or not it should think.
The hard switch is availabe in SGLang through the following configuration to the API call.
To disable thinking, use
::::{tab-set}
:::{tab-item} curl
```shell
curl http://localhost:30000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"max_tokens": 8192,
"presence_penalty": 1.5,
"chat_template_kwargs": {"enable_thinking": false}
}'
```
:::
:::{tab-item} Python
You can use the API client with the `openai` Python SDK as shown below:
```python
from openai import OpenAI
# Set OpenAI's API key and API base to use SGLang's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:30000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
temperature=0.7,
top_p=0.8,
top_k=20,
presence_penalty=1.5,
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)
print("Chat response:", chat_response)
```
::::
:::{tip}
It is recommended to set sampling parameters differently for thinking and non-thinking modes.
:::
### Parsing Thinking Content
SGLang supports parsing the thinking content from the model generation into structured messages:
```shell
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --reasoning-parser deepseek-r1
```
The response message will have a field named `reasoning_content` in addition to `content`, containing the thinking content generated by the model.
:::{note}
Please note that this feature is not OpenAI API compatible.
:::
### Parsing Tool Calls
SGLang supports parsing the tool calling content from the model generation into structured messages:
```shell
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --tool-call-parser qwen25
```
For more information, please refer to [our guide on Function Calling](../framework/function_call.md).
### Structured/JSON Output
SGLang supports structured/JSON output.
Please refer to [SGLang's documentation](https://docs.sglang.ai/backend/structured_outputs.html#OpenAI-Compatible-API).
Besides, it is also recommended to instruct the model to generate the specific format in the system message or in your prompt.
### Serving Quantized models
Qwen3 comes with two types of pre-quantized models, FP8 and AWQ.
The command serving those models are the same as the original models except for the name change:
```shell
# For FP8 quantized model
python -m sglang.launch_server --model-path Qwen3/Qwen3-8B-FP8
# For AWQ quantized model
python -m sglang.launch_server --model-path Qwen3/Qwen3-8B-AWQ
```
### Context Length
The context length for Qwen3 models in pretraining is up to 32,768 tokenns.
To handle context length substantially exceeding 32,768 tokens, RoPE scaling techniques should be applied.
We have validated the performance of [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
SGLang supports YaRN, which can be configured as
```shell
python -m sglang.launch_server --model-path Qwen3/Qwen3-8B --json-model-override-args '{"rope_scaling":{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}'
```
:::{note}
SGLang implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
We advise adding the `rope_scaling` configuration only when processing long contexts is required.
It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
:::
:::{note}
The default `max_position_embeddings` in `config.json` is set to 40,960, which is used by SGLang.
This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing and leave adequate room for model thinking.
If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
:::
SkyPilot
========
.. attention::
To be updated for Qwen3.
What is SkyPilot
----------------
SkyPilot is a framework for running LLMs, AI, and batch jobs on any
cloud, offering maximum cost savings, the highest GPU availability, and
managed execution. Its features include:
- Get the best GPU availability by utilizing multiple resources pools
across multiple regions and clouds.
- Pay absolute minimum — SkyPilot picks the cheapest resources across
regions and clouds. No managed solution markups.
- Scale up to multiple replicas across different locations and
accelerators, all served with a single endpoint
- Everything stays in your cloud account (your VMs & buckets)
- Completely private - no one else sees your chat history
Install SkyPilot
----------------
We advise you to follow the
`instruction <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__
to install SkyPilot. Here we provide a simple example of using ``pip``
for the installation as shown below.
.. code:: bash
# You can use any of the following clouds that you have access to:
# aws, gcp, azure, oci, lamabda, runpod, fluidstack, paperspace,
# cudo, ibm, scp, vsphere, kubernetes
pip install "skypilot-nightly[aws,gcp]"
After that, you need to verify cloud access with a command like:
.. code:: bash
sky check
For more information, check the `official document <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__ and see if you have
set up your cloud accounts correctly.
Alternatively, you can also use the official docker image with SkyPilot
master branch automatically cloned by running:
.. code:: bash
# NOTE: '--platform linux/amd64' is needed for Apple Silicon Macs
docker run --platform linux/amd64 \
-td --rm --name sky \
-v "$HOME/.sky:/root/.sky:rw" \
-v "$HOME/.aws:/root/.aws:rw" \
-v "$HOME/.config/gcloud:/root/.config/gcloud:rw" \
berkeleyskypilot/skypilot-nightly
docker exec -it sky /bin/bash
Running Qwen2.5-72B-Instruct with SkyPilot
------------------------------------------
1. Start serving Qwen2.5-72B-Instruct on a single instance with any
available GPU in the list specified in
`serve-72b.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml>`__
with a vLLM-powered OpenAI-compatible endpoint:
.. code:: bash
sky launch -c qwen serve-72b.yaml
**Before launching, make sure you have changed Qwen/Qwen2-72B-Instruct to Qwen/Qwen2.5-72B-Instruct in the YAML file.**
2. Send a request to the endpoint for completion:
.. code:: bash
IP=$(sky status --ip qwen)
curl -L http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-72B-Instruct",
"prompt": "My favorite food is",
"max_tokens": 512
}' | jq -r '.choices[0].text'
3. Send a request for chat completion:
.. code:: bash
curl -L http://$IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-72B-Instruct",
"messages": [
{
"role": "system",
"content": "You are Qwen, created by Alibaba Cloud. You are a helpful and honest chat expert."
},
{
"role": "user",
"content": "What is the best food?"
}
],
"max_tokens": 512
}' | jq -r '.choices[0].message.content'
Scale up the service with SkyPilot Serve
----------------------------------------
1. With `SkyPilot
Serve <https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html>`__,
a serving library built on top of SkyPilot, scaling up the Qwen
service is as simple as running:
.. code:: bash
sky serve up -n qwen ./serve-72b.yaml
**Before launching, make sure you have changed Qwen/Qwen2-72B-Instruct to Qwen/Qwen2.5-72B-Instruct in the YAML file.**
This will start the service with multiple replicas on the cheapest
available locations and accelerators. SkyServe will automatically manage
the replicas, monitor their health, autoscale based on load, and restart
them when needed.
A single endpoint will be returned and any request sent to the endpoint
will be routed to the ready replicas.
2. To check the status of the service, run:
.. code:: bash
sky serve status qwen
After a while, you will see the following output:
::
Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
Qwen 1 - READY 2/2 3.85.107.228:30002
Service Replicas
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
Qwen 1 1 - 2 mins ago 1x Azure({'A100-80GB': 8}) READY eastus
Qwen 2 1 - 2 mins ago 1x GCP({'L4': 8}) READY us-east4-a
As shown, the service is now backed by 2 replicas, one on Azure and one
on GCP, and the accelerator type is chosen to be **the cheapest
available one** on the clouds. That said, it maximizes the availability
of the service while minimizing the cost.
3. To access the model, we use a ``curl -L`` command (``-L`` to follow
redirect) to send the request to the endpoint:
.. code:: bash
ENDPOINT=$(sky serve status --endpoint qwen)
curl -L http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-72B-Instruct",
"messages": [
{
"role": "system",
"content": "You are Qwen, created by Alibaba Cloud. You are a helpful and honest code assistant expert in Python."
},
{
"role": "user",
"content": "Show me the python code for quick sorting a list of integers."
}
],
"max_tokens": 512
}' | jq -r '.choices[0].message.content'
Accessing Qwen2.5 with Chat GUI
---------------------------------------------
It is also possible to access the Qwen2.5 service with GUI by connecting a
`FastChat GUI server <https://github.com/lm-sys/FastChat>`__ to the endpoint launched
above (see `gui.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/gui.yaml>`__).
1. Start the Chat Web UI:
.. code:: bash
sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint qwen)
**Before launching, make sure you have changed Qwen/Qwen1.5-72B-Chat to Qwen/Qwen2.5-72B-Instruct in the YAML file.**
2. Then, we can access the GUI at the returned gradio link:
::
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
Note that you may get better results by using a different temperature and top_p value.
Summary
-------
With SkyPilot, it is easy for you to deploy Qwen2.5 on any cloud. We
advise you to read the official doc for more usages and updates.
Check `this <https://skypilot.readthedocs.io/>`__ out!
TGI
=====================
.. attention::
To be updated for Qwen3.
Hugging Face's Text Generation Inference (TGI) is a production-ready framework specifically designed for deploying and serving large language models (LLMs) for text generation tasks. It offers a seamless deployment experience, powered by a robust set of features:
* `Speculative Decoding <Speculative Decoding_>`_: Accelerates generation speeds.
* `Tensor Parallelism`_: Enables efficient deployment across multiple GPUs.
* `Token Streaming`_: Allows for the continuous generation of text.
* Versatile Device Support: Works seamlessly with `AMD`_, `Gaudi`_ and `AWS Inferentia`_.
.. _AMD: https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/deploy-your-model.html#serving-using-hugging-face-tgi
.. _Gaudi: https://github.com/huggingface/tgi-gaudi
.. _AWS Inferentia: https://aws.amazon.com/blogs/machine-learning/announcing-the-launch-of-new-hugging-face-llm-inference-containers-on-amazon-sagemaker/#:~:text=Get%20started%20with%20TGI%20on%20SageMaker%20Hosting
.. _Tensor Parallelism: https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism
.. _Token Streaming: https://huggingface.co/docs/text-generation-inference/conceptual/streaming
Installation
-----------------
The easiest way to use TGI is via the TGI docker image. In this guide, we show how to use TGI with docker.
It's possible to run it locally via Conda or build locally. Please refer to `Installation Guide <https://huggingface.co/docs/text-generation-inference/installation>`_ and `CLI tool <https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/using_cli>`_ for detailed instructions.
Deploy Qwen2.5 with TGI
-----------------------
1. **Find a Qwen2.5 Model:** Choose a model from `the Qwen2.5 collection <https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e>`_.
2. **Deployment Command:** Run the following command in your terminal, replacing ``model`` with your chosen Qwen2.5 model ID and ``volume`` with the path to your local data directory:
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model
Using TGI API
-------------
Once deployed, the model will be available on the mapped port (8080).
TGI comes with a handy API for streaming response:
.. code:: bash
curl http://localhost:8080/generate_stream -H 'Content-Type: application/json' \
-d '{"inputs":"Tell me something about large language models.","parameters":{"max_new_tokens":512}}'
It's also available on OpenAI style API:
.. code:: bash
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "",
"messages": [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'
.. note::
The model field in the JSON is not used by TGI, you can put anything.
Refer to the `TGI Swagger UI <https://huggingface.github.io/text-generation-inference/#/Text%20Generation%20Inference/completions>`_ for a complete API reference.
You can also use Python API:
.. code:: python
from openai import OpenAI
# initialize the client but point it to TGI
client = OpenAI(
base_url="http://localhost:8080/v1/", # replace with your endpoint url
api_key="", # this field is not used when running locally
)
chat_completion = client.chat.completions.create(
model="", # it is not used by TGI, you can put anything
messages=[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
],
stream=True,
temperature=0.7,
top_p=0.8,
max_tokens=512,
)
# iterate and print stream
for message in chat_completion:
print(message.choices[0].delta.content, end="")
Quantization for Performance
----------------------------
1. Data-dependent quantization (GPTQ and AWQ)
Both GPTQ and AWQ models are data-dependent. The official quantized models can be found from `the Qwen2.5 collection`_ and you can also quantize models with your own dataset to make it perform better on your use case.
The following shows the command to start TGI with Qwen2.5-7B-Instruct-GPTQ-Int4:
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --quantize gptq
If the model is quantized with AWQ, e.g. Qwen/Qwen2.5-7B-Instruct-AWQ, please use ``--quantize awq``.
2. Data-agnostic quantization
EETQ on the other side is not data dependent and can be used with any model. Note that we're passing in the original model (instead of a quantized model) with the ``--quantize eetq`` flag.
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --quantize eetq
Multi-Accelerators Deployment
-----------------------------
Use the ``--num-shard`` flag to specify the number of accelerators. Please also use ``--shm-size 1g`` to enable shared memory for optimal NCCL performance (`reference <https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#a-note-on-shared-memory-shm>`__):
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --num-shard 2
Speculative Decoding
--------------------
Speculative decoding can reduce the time per token by speculating on the next token. Use the ``--speculative-decoding`` flag, setting the value to the number of tokens to speculate on (default: 0 for no speculation):
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 2
The overall performance of speculative decoding highly depends on the type of task. It works best for code or highly repetitive text.
More context on speculative decoding can be found `here <https://huggingface.co/docs/text-generation-inference/conceptual/speculation>`__.
Zero-Code Deployment with HF Inference Endpoints
---------------------------------------------------
For effortless deployment, leverage Hugging Face Inference Endpoints:
- **GUI interface:** `<https://huggingface.co/inference-endpoints/dedicated>`__
- **Coding interface:** `<https://huggingface.co/blog/tgi-messages-api>`__
Once deployed, the endpoint can be used as usual.
Common Issues
----------------
Qwen2.5 supports long context lengths, so carefully choose the values for ``--max-batch-prefill-tokens``, ``--max-total-tokens``, and ``--max-input-tokens`` to avoid potential out-of-memory (OOM) issues. If an OOM occurs, you'll receive an error message upon startup. The following shows an example to modify those parameters:
.. code:: bash
model=Qwen/Qwen2.5-7B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --max-batch-prefill-tokens 4096 --max-total-tokens 4096 --max-input-tokens 2048
\ No newline at end of file
# vLLM
We recommend you trying [vLLM](https://github.com/vllm-project/vllm) for your deployment of Qwen.
It is simple to use, and it is fast with state-of-the-art serving throughput, efficient management of attention key value memory with PagedAttention, continuous batching of input requests, optimized CUDA kernels, etc.
To learn more about vLLM, please refer to the [paper](https://arxiv.org/abs/2309.06180) and [documentation](https://docs.vllm.ai/).
## Environment Setup
By default, you can install `vllm` with pip in a clean environment:
```shell
pip install "vllm>=0.8.4"
```
Please note that the prebuilt `vllm` has strict dependencies on `torch` and its CUDA versions.
Check the note in the official document for installation ([link](https://docs.vllm.ai/en/latest/getting_started/installation.html)) for more help.
## API Service
It is easy to build an OpenAI-compatible API service with vLLM, which can be deployed as a server that implements OpenAI API protocol.
By default, it starts the server at `http://localhost:8000`.
You can specify the address with `--host` and `--port` arguments.
Run the command as shown below:
```shell
vllm serve Qwen/Qwen3-8B
```
By default, if the model does not point to a valid local directory, it will download the model files from the HuggingFace Hub.
To download model from ModelScope, set the following before running the above command:
```shell
export VLLM_USE_MODELSCOPE=true
```
For distrbiuted inference with tensor parallelism, it is as simple as
```shell
vllm server Qwen/Qwen3-8B --tensor-parallel-size 4
```
The above command will use tensor parallelism on 4 GPUs.
You should change the number of GPUs according to your demand.
### Basic Usage
Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:
::::{tab-set}
:::{tab-item} curl
```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
```
:::
:::{tab-item} Python
You can use the API client with the `openai` Python SDK as shown below:
```python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
temperature=0.6,
top_p=0.95,
top_k=20,
max_tokens=32768,
)
print("Chat response:", chat_response)
```
::::
:::{tip}
`vllm` will use the sampling parameters from the `generation_config.json` in the model files.
While the default sampling parameters would work most of the time for thinking mode,
it is recommended to adjust the sampling parameters according to your application,
and always pass the sampling parameters to the API.
:::
### Thinking & Non-Thinking Modes
Qwen3 models will think before respond.
This behaviour could be controled by either the hard switch, which could disable thinking completely, or the soft switch, where the model follows the instruction of the user on whether or not it should think.
The hard switch is availabe in vLLM through the following configuration to the API call.
To disable thinking, use
::::{tab-set}
:::{tab-item} curl
```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"max_tokens": 8192,
"presence_penalty": 1.5,
"chat_template_kwargs": {"enable_thinking": false}
}'
```
:::
:::{tab-item} Python
You can use the API client with the `openai` Python SDK as shown below:
```python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
temperature=0.7,
top_p=0.8,
top_k=20,
presence_penalty=1.5,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print("Chat response:", chat_response)
```
::::
:::{tip}
It is recommended to set sampling parameters differently for thinking and non-thinking modes.
:::
### Parsing Thinking Content
vLLM supports parsing the thinking content from the model generation into structured messages:
```shell
vllm serve Qwen/Qwen3-8B --enable-reasoning-parser --reasoning-parser deepseek_r1
```
The response message will have a field named `reasoning_content` in addition to `content`, containing the thinking content generated by the model.
:::{note}
Please note that this feature is not OpenAI API compatible.
:::
### Parsing Tool Calls
vLLM supports parsing the tool calling content from the model generation into structured messages:
```shell
vllm serve Qwen/Qwen3-8B --enable-auto-tool-choice --tool-call-parser hermes
```
For more information, please refer to [our guide on Function Calling](../framework/function_call.md#vllm).
:::{note}
As of vLLM 0.5.4, it is not supported to parse the thinking content and the tool calling from the model generation at the same time.
:::
### Structured/JSON Output
vLLM supports structured/JSON output.
Please refer to [vLLM's documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#extra-parameters-for-chat-api) for the `guided_json` parameters.
Besides, it is also recommended to instruct the model to generate the specific format in the system message or in your prompt.
### Serving Quantized models
Qwen3 comes with two types of pre-quantized models, FP8 and AWQ.
The command serving those models are the same as the original models except for the name change:
```shell
# For FP8 quantized model
vllm serve Qwen3/Qwen3-8B-FP8
# For AWQ quantized model
vllm serve Qwen3/Qwen3-8B-AWQ
```
:::{note}
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9, that is, Ada Lovelace, Hopper, and later GPUs.
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
:::
:::{important}
As of vLLM 0.5.4, there are currently compatibility issues with `vllm` with the Qwen3 FP8 checkpoints.
For a quick fix, you should make the following changes to the file `vllm/vllm/model_executor/layers/linear.py`:
```python
# these changes are in QKVParallelLinear.weight_loader_v2() of vllm/vllm/model_executor/layers/linear.py
...
shard_offset = self._get_shard_offset_mapping(loaded_shard_id)
shard_size = self._get_shard_size_mapping(loaded_shard_id)
# add the following code
if isinstance(param, BlockQuantScaleParameter):
weight_block_size = self.quant_method.quant_config.weight_block_size
block_n, _ = weight_block_size[0], weight_block_size[1]
shard_offset = (shard_offset + block_n - 1) // block_n
shard_size = (shard_size + block_n - 1) // block_n
# end of the modification
param.load_qkv_weight(loaded_weight=loaded_weight,
num_heads=self.num_kv_head_replicas,
shard_id=loaded_shard_id,
shard_offset=shard_offset,
shard_size=shard_size)
...
```
:::
### Context Length
The context length for Qwen3 models in pretraining is up to 32,768 tokenns.
To handle context length substantially exceeding 32,768 tokens, RoPE scaling techniques should be applied.
We have validated the performance of [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
vLLM supports YaRN, which can be configured as
```shell
vllm serve Qwen3/Qwen3-8B --rope-scaling '{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072
```
:::{note}
vLLM implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
We advise adding the `rope_scaling` configuration only when processing long contexts is required.
It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
:::
:::{note}
The default `max_position_embeddings` in `config.json` is set to 40,960, which used by vLLM, if `--max-model-len` is not specified.
This allocation includes reserving 32,768 tokens for outputs and 8,192 tokens for typical prompts, which is sufficient for most scenarios involving short text processing and leave adequate room for model thinking.
If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
:::
## Python Library
vLLM can also be directly used as a Python library, which is convinient for offline batch inference but lack some API-only features, such as parsing model generation to structure messages.
The following shows the basic usage of vLLM as a library:
```python
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
# Configurae the sampling parameters (for thinking mode)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, max_tokens=32768)
# Initialize the vLLM engine
llm = LLM(model="Qwen/Qwen3-8B")
# Prepare the input to the model
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
# Generate outputs
outputs = llm.generate([text], sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
## FAQ
You may encounter OOM issues that are pretty annoying.
We recommend two arguments for you to make some fix.
- The first one is `--max-model-len`.
Our provided default `max_position_embedding` is `40960` and thus the maximum length for the serving is also this value, leading to higher requirements of memory.
Reducing it to a proper length for yourself often helps with the OOM issue.
- Another argument you can pay attention to is `--gpu-memory-utilization`.
vLLM will pre-allocate this much GPU memory.
By default, it is `0.9`.
This is also why you find a vLLM service always takes so much memory.
If you are in eager mode (by default it is not), you can level it up to tackle the OOM problem.
Otherwise, CUDA Graphs are used, which will use GPU memory not controlled by vLLM, and you should try lowering it.
If it doesn't work, you should try `--enforce-eager`, which may slow down infernece, or reduce the `--max-model-len`.
Langchain
==========================
.. attention::
To be updated for Qwen3.
This guide helps you build a question-answering application based
on a local knowledge base using ``Qwen2.5-7B-Instruct`` with ``langchain``.
The goal is to establish a knowledge base Q&A solution.
Basic Usage
-----------
The implementation process of this project includes
loading files -> reading text -> segmenting text -> vectorizing text -> vectorizing questions
-> matching the top k most similar text vectors with the question vectors ->
incorporating the matched text as context along with the question into the prompt ->
submitting to the Qwen2.5-7B-Instruct to generate an answer.
Below is an example:
.. code:: bash
pip install langchain==0.0.174
pip install faiss-gpu
.. code:: python
from transformers import AutoModelForCausalLM, AutoTokenizer
from abc import ABC
from langchain.llms.base import LLM
from typing import Any, List, Mapping, Optional
from langchain.callbacks.manager import CallbackManagerForLLMRun
model_name = "Qwen/Qwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
class Qwen(LLM, ABC):
max_token: int = 10000
temperature: float = 0.01
top_p = 0.9
history_len: int = 3
def __init__(self):
super().__init__()
@property
def _llm_type(self) -> str:
return "Qwen"
@property
def _history_len(self) -> int:
return self.history_len
def set_history_len(self, history_len: int = 10) -> None:
self.history_len = history_len
def _call(
self,
prompt: str,
stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
) -> str:
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response
@property
def _identifying_params(self) -> Mapping[str, Any]:
"""Get the identifying parameters."""
return {"max_token": self.max_token,
"temperature": self.temperature,
"top_p": self.top_p,
"history_len": self.history_len}
After loading the Qwen2.5-7B-Instruct model, you should specify the txt file
for retrieval.
.. code:: python
import os
import re
import torch
import argparse
from langchain.vectorstores import FAISS
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from typing import List, Tuple
import numpy as np
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain.prompts.prompt import PromptTemplate
from langchain.chains import RetrievalQA
class ChineseTextSplitter(CharacterTextSplitter):
def __init__(self, pdf: bool = False, **kwargs):
super().__init__(**kwargs)
self.pdf = pdf
def split_text(self, text: str) -> List[str]:
if self.pdf:
text = re.sub(r"\n{3,}", "\n", text)
text = re.sub('\s', ' ', text)
text = text.replace("\n\n", "")
sent_sep_pattern = re.compile(
'([﹒﹔﹖﹗.。!?]["’”」』]{0,2}|(?=["‘“「『]{1,2}|$))')
sent_list = []
for ele in sent_sep_pattern.split(text):
if sent_sep_pattern.match(ele) and sent_list:
sent_list[-1] += ele
elif ele:
sent_list.append(ele)
return sent_list
def load_file(filepath):
loader = TextLoader(filepath, autodetect_encoding=True)
textsplitter = ChineseTextSplitter(pdf=False)
docs = loader.load_and_split(textsplitter)
write_check_file(filepath, docs)
return docs
def write_check_file(filepath, docs):
folder_path = os.path.join(os.path.dirname(filepath), "tmp_files")
if not os.path.exists(folder_path):
os.makedirs(folder_path)
fp = os.path.join(folder_path, 'load_file.txt')
with open(fp, 'a+', encoding='utf-8') as fout:
fout.write("filepath=%s,len=%s" % (filepath, len(docs)))
fout.write('\n')
for i in docs:
fout.write(str(i))
fout.write('\n')
fout.close()
def separate_list(ls: List[int]) -> List[List[int]]:
lists = []
ls1 = [ls[0]]
for i in range(1, len(ls)):
if ls[i - 1] + 1 == ls[i]:
ls1.append(ls[i])
else:
lists.append(ls1)
ls1 = [ls[i]]
lists.append(ls1)
return lists
class FAISSWrapper(FAISS):
chunk_size = 250
chunk_conent = True
score_threshold = 0
def similarity_search_with_score_by_vector(
self, embedding: List[float], k: int = 4
) -> List[Tuple[Document, float]]:
scores, indices = self.index.search(np.array([embedding], dtype=np.float32), k)
docs = []
id_set = set()
store_len = len(self.index_to_docstore_id)
for j, i in enumerate(indices[0]):
if i == -1 or 0 < self.score_threshold < scores[0][j]:
# This happens when not enough docs are returned.
continue
_id = self.index_to_docstore_id[i]
doc = self.docstore.search(_id)
if not self.chunk_conent:
if not isinstance(doc, Document):
raise ValueError(f"Could not find document for id {_id}, got {doc}")
doc.metadata["score"] = int(scores[0][j])
docs.append(doc)
continue
id_set.add(i)
docs_len = len(doc.page_content)
for k in range(1, max(i, store_len - i)):
break_flag = False
for l in [i + k, i - k]:
if 0 <= l < len(self.index_to_docstore_id):
_id0 = self.index_to_docstore_id[l]
doc0 = self.docstore.search(_id0)
if docs_len + len(doc0.page_content) > self.chunk_size:
break_flag = True
break
elif doc0.metadata["source"] == doc.metadata["source"]:
docs_len += len(doc0.page_content)
id_set.add(l)
if break_flag:
break
if not self.chunk_conent:
return docs
if len(id_set) == 0 and self.score_threshold > 0:
return []
id_list = sorted(list(id_set))
id_lists = separate_list(id_list)
for id_seq in id_lists:
for id in id_seq:
if id == id_seq[0]:
_id = self.index_to_docstore_id[id]
doc = self.docstore.search(_id)
else:
_id0 = self.index_to_docstore_id[id]
doc0 = self.docstore.search(_id0)
doc.page_content += " " + doc0.page_content
if not isinstance(doc, Document):
raise ValueError(f"Could not find document for id {_id}, got {doc}")
doc_score = min([scores[0][id] for id in [indices[0].tolist().index(i) for i in id_seq if i in indices[0]]])
doc.metadata["score"] = int(doc_score)
docs.append((doc, doc_score))
return docs
if __name__ == '__main__':
# load docs (pdf file or txt file)
filepath = 'your file path'
# Embedding model name
EMBEDDING_MODEL = 'text2vec'
PROMPT_TEMPLATE = """Known information:
{context_str}
Based on the above known information, respond to the user's question concisely and professionally. If an answer cannot be derived from it, say 'The question cannot be answered with the given information' or 'Not enough relevant information has been provided,' and do not include fabricated details in the answer. Please respond in English. The question is {question}"""
# Embedding running device
EMBEDDING_DEVICE = "cuda"
# return top-k text chunk from vector store
VECTOR_SEARCH_TOP_K = 3
CHAIN_TYPE = 'stuff'
embedding_model_dict = {
"text2vec": "your text2vec model path",
}
llm = Qwen()
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_dict[EMBEDDING_MODEL],model_kwargs={'device': EMBEDDING_DEVICE})
docs = load_file(filepath)
docsearch = FAISSWrapper.from_documents(docs, embeddings)
prompt = PromptTemplate(
template=PROMPT_TEMPLATE, input_variables=["context_str", "question"]
)
chain_type_kwargs = {"prompt": prompt, "document_variable_name": "context_str"}
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type=CHAIN_TYPE,
retriever=docsearch.as_retriever(search_kwargs={"k": VECTOR_SEARCH_TOP_K}),
chain_type_kwargs=chain_type_kwargs)
query = "Give me a short introduction to large language model."
print(qa.run(query))
Next Step
---------
Now you can chat with Qwen2.5 use your own document. Continue
to read the documentation and try to figure out more advanced usages of
model retrieval!
\ No newline at end of file
LlamaIndex
==========
.. attention::
To be updated for Qwen3.
To connect Qwen2.5 with external data, such as documents, web pages, etc., we offer a tutorial on `LlamaIndex <https://www.llamaindex.ai/>`__.
This guide helps you quickly implement retrieval-augmented generation (RAG) using LlamaIndex with Qwen2.5.
Preparation
--------------------------------------
To implement RAG,
we advise you to install the LlamaIndex-related packages first.
The following is a simple code snippet showing how to do this:
.. code:: bash
pip install llama-index
pip install llama-index-llms-huggingface
pip install llama-index-readers-web
Set Parameters
--------------------------------------
Now we can set up LLM, embedding model, and the related configurations.
Qwen2.5-Instruct supports conversations in multiple languages, including English and Chinese.
You can use the ``bge-base-en-v1.5`` model to retrieve from English documents, and you can download the ``bge-base-zh-v1.5`` model to retrieve from Chinese documents.
You can also choose ``bge-large`` or ``bge-small`` as the embedding model or modify the context window size or text chunk size depending on your computing resources.
Qwen2.5 model families support a maximum of 32K context window size (up to 128K for 7B, 14B, 32B, and 72B, requiring extra configuration)
.. code:: python
import torch
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Set prompt template for generation (optional)
from llama_index.core import PromptTemplate
def completion_to_prompt(completion):
return f"<|im_start|>system\n<|im_end|>\n<|im_start|>user\n{completion}<|im_end|>\n<|im_start|>assistant\n"
def messages_to_prompt(messages):
prompt = ""
for message in messages:
if message.role == "system":
prompt += f"<|im_start|>system\n{message.content}<|im_end|>\n"
elif message.role == "user":
prompt += f"<|im_start|>user\n{message.content}<|im_end|>\n"
elif message.role == "assistant":
prompt += f"<|im_start|>assistant\n{message.content}<|im_end|>\n"
if not prompt.startswith("<|im_start|>system"):
prompt = "<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n" + prompt
prompt = prompt + "<|im_start|>assistant\n"
return prompt
# Set Qwen2.5 as the language model and set generation config
Settings.llm = HuggingFaceLLM(
model_name="Qwen/Qwen2.5-7B-Instruct",
tokenizer_name="Qwen/Qwen2.5-7B-Instruct",
context_window=30000,
max_new_tokens=2000,
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="auto",
)
# Set embedding model
Settings.embed_model = HuggingFaceEmbedding(
model_name = "BAAI/bge-base-en-v1.5"
)
# Set the size of the text chunk for retrieval
Settings.transformations = [SentenceSplitter(chunk_size=1024)]
Build Index
--------------------------------------
Now we can build index from documents or websites.
The following code snippet demonstrates how to build an index for files (regardless of whether they are in PDF or TXT format) in a local folder named 'document'.
.. code:: python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("./document").load_data()
index = VectorStoreIndex.from_documents(
documents,
embed_model=Settings.embed_model,
transformations=Settings.transformations
)
The following code snippet demonstrates how to build an index for the content in a list of websites.
.. code:: python
from llama_index.readers.web import SimpleWebPageReader
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleWebPageReader(html_to_text=True).load_data(
["web_address_1","web_address_2",...]
)
index = VectorStoreIndex.from_documents(
documents,
embed_model=Settings.embed_model,
transformations=Settings.transformations
)
To save and load the index, you can use the following code snippet.
.. code:: python
from llama_index.core import StorageContext, load_index_from_storage
# save index
storage_context = StorageContext.from_defaults(persist_dir="save")
# load index
index = load_index_from_storage(storage_context)
RAG
-------------------
Now you can perform queries, and Qwen2.5 will answer based on the content of the indexed documents.
.. code:: python
query_engine = index.as_query_engine()
your_query = "<your query here>"
print(query_engine.query(your_query).response)
---
myst:
number_code_blocks: ["python3"]
---
# Function Calling
:::{attention}
To be updated for Qwen3.
Since the support for tool calling in Qwen3 is a superset of that in Qwen2, the examples would still work.
:::
## Preface
Function calling with large language models is a huge and evolving topic.
It is particularly important for AI applications:
- either for AI-native applications that strive to work around the shortcomings of current AI technology,
- or for existing applications that seeks the integration of AI technology to improve performance, user interaction and experience, or efficiency.
This guide will not delve into those discussions or which role an LLM should play in an application and the related best practice.
Those views are reflected in the design of AI application frameworks: from LangChain to LlamaIndex to QwenAgent.
Instead, we will talk about how Qwen2.5 can be used to support function calling and how it can be used to achieve your goals, from the inference usage for developing application to the inner workings for hardcore customizations.
In this guide,
- We will first demonstrate how to use function calling with Qwen2.5.
- Then, we will introduce the technical details on functional calling with Qwen2.5, which are mainly about the templates.
Before starting, there is one thing we have not yet introduced, that is ...
## What is function calling?
:::{Note}
There is another term "tool use" that may be used to refer to the same concept.
While some may argue that tools are a generalized form of functions, at present, their difference exists only technically as different I/O types of programming interfaces.
:::
Large language models (LLMs) are powerful things.
However, sometimes LLMs by themselves are simply not capable enough.
- On the one hand, LLMs have inherent modeling limitations.
For one, they do not know things that are not in their training data, which include those happened after their training ended.
In addition, they learn things in the way of likelihood, which suggests that they may not be precise enough for tasks with fixed rule sets, e.g., mathematical computation.
- On the other hand, it is not easy to use LLMs as a Plug-and-Play service programmatically with other things.
LLMs mostly talk in words that are open to interpretation and thus ambiguous, while other software or applications or systems talk in code and through programming interfaces that are pre-defined and fixed and structured.
To this end, function calling establishes a common protocol that specifies how LLMs should interact with the other things.
The procedure is mainly as follows:
1. The application provides a set of functions and the instructions of the functions to an LLM.
2. The LLM choose to or not to, or is forced to use one or many of the functions, in response to user queries.
3. If the LLM chooses to use the functions, it states how the functions should be used based on the function instructions.
4. The chosen functions are used as such by the application and the results are obtained, which are then given to the LLM if further interaction is needed.
They are many ways for LLMs to understand and follow this protocol.
As always, the key is prompt engineering or an internalized template known by the model.
Qwen2.5 were pre-trained with various types of templates that could support function calling, so that users can directly make use of this procedure.
## Inference with Function Calling
:::{note}
Please be aware that the inference usage is subject to change as the frameworks and the Qwen models evolve.
:::
As function calling is essentially implemented using prompt engineering, you could manually construct the model inputs for Qwen2 models.
However, frameworks with function calling support can help you with all that laborious work.
In the following, we will introduce the usage (via dedicated function calling chat template) with
- **Qwen-Agent**,
- **Hugging Face transformers**,
- **Ollama**, and
- **vLLM**.
If you are familiar with the usage of OpenAI API, you could also directly use the OpenAI-compatible API services for Qwen2.5.
However, not all of them support function calling for Qwen2.5.
Currently, supported solutions include the self-hosted service by [Ollama](https://github.com/ollama/ollama/blob/main/docs/openai.md) or [vLLM](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api) and the cloud service of [ModelStudio \[zh\]](https://help.aliyun.com/zh/model-studio/developer-reference/compatibility-of-openai-with-dashscope#97e2b45391x08).
If you are familiar with application frameworks, e.g., LangChain, you can also use function calling abilities in Qwen2.5 via ReAct Prompting.
### The Example Case
Let's also use an example to demonstrate the inference usage.
We assume **Python 3.11** is used as the programming language.
**Scenario**: Suppose we would like to ask the model about the temperature of a location.
Normally, the model would reply that it cannot provide real-time information.
But we have two tools that can be used to obtain the current temperature of and the temperature at a given date of a city respectively, and we would like the model to make use of them.
To set up the example case, you can use the following code:
:::{dropdown} Preparation Code
:name: prepcode
```python
import json
def get_current_temperature(location: str, unit: str = "celsius"):
"""Get current temperature at a location.
Args:
location: The location to get the temperature for, in the format "City, State, Country".
unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
Returns:
the temperature, the location, and the unit in a dict
"""
return {
"temperature": 26.1,
"location": location,
"unit": unit,
}
def get_temperature_date(location: str, date: str, unit: str = "celsius"):
"""Get temperature at a location and date.
Args:
location: The location to get the temperature for, in the format "City, State, Country".
date: The date to get the temperature for, in the format "Year-Month-Day".
unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
Returns:
the temperature, the location, the date and the unit in a dict
"""
return {
"temperature": 25.9,
"location": location,
"date": date,
"unit": unit,
}
def get_function_by_name(name):
if name == "get_current_temperature":
return get_current_temperature
if name == "get_temperature_date":
return get_temperature_date
TOOLS = [
{
"type": "function",
"function": {
"name": "get_current_temperature",
"description": "Get current temperature at a location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": 'The location to get the temperature for, in the format "City, State, Country".',
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": 'The unit to return the temperature in. Defaults to "celsius".',
},
},
"required": ["location"],
},
},
},
{
"type": "function",
"function": {
"name": "get_temperature_date",
"description": "Get temperature at a location and date.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": 'The location to get the temperature for, in the format "City, State, Country".',
},
"date": {
"type": "string",
"description": 'The date to get the temperature for, in the format "Year-Month-Day".',
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": 'The unit to return the temperature in. Defaults to "celsius".',
},
},
"required": ["location", "date"],
},
},
},
]
MESSAGES = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30"},
{"role": "user", "content": "What's the temperature in San Francisco now? How about tomorrow?"},
]
```
:::
In particular, the tools should be described using JSON Schema and the messages should contain as much available information as possible.
You can find the explanations of the tools and messages below:
:::{dropdown} Example Tools
The tools should be described using the following JSON:
```json
[
{
"type": "function",
"function": {
"name": "get_current_temperature",
"description": "Get current temperature at a location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for, in the format \"City, State, Country\"."
},
"unit": {
"type": "string",
"enum": [
"celsius",
"fahrenheit"
],
"description": "The unit to return the temperature in. Defaults to \"celsius\"."
}
},
"required": [
"location"
]
}
}
},
{
"type": "function",
"function": {
"name": "get_temperature_date",
"description": "Get temperature at a location and date.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get the temperature for, in the format \"City, State, Country\"."
},
"date": {
"type": "string",
"description": "The date to get the temperature for, in the format \"Year-Month-Day\"."
},
"unit": {
"type": "string",
"enum": [
"celsius",
"fahrenheit"
],
"description": "The unit to return the temperature in. Defaults to \"celsius\"."
}
},
"required": [
"location",
"date"
]
}
}
}
]
```
For each **tool**, it is a JSON object with two fields:
- `type`: a string specifying the type of the tool, currently only `"function"` is valid
- `function`: an object detailing the instructions to use the function
For each **function**, it is a JSON object with three fields:
- `name`: a string indicating the name of the function
- `description`: a string describing what the function is used for
- `parameters`: [a JSON Schema](https://json-schema.org/learn/getting-started-step-by-step) that specifies the parameters the function accepts. Please refer to the linked documentation for how to compose a JSON Schema. Notable fields include `type`, `required`, and `enum`.
Most frameworks use the tool format and some may use the function format.
Which one to use should be obvious according to the naming.
:::
:::{dropdown} Example Messages
Our query is `What's the temperature in San Francisco now? How about tomorrow?`.
Since the model does not know what the current date is, let alone tomorrow, we should provide the date in the inputs.
Here, we decide to supply that information in the system message after the default system message `You are Qwen, created by Alibaba Cloud. You are a helpful assistant.`.
You could append the date to user message in your application code.
```json
[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30"},
{"role": "user", "content": "What's the temperature in San Francisco now? How about tomorrow?"}
]
```
:::
### Qwen-Agent
[Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) is actually a Python Agent framework for developing AI applications.
Although its intended use cases are higher-level than efficient inference, it does contain the **canonical implementation** of function calling for Qwen2.5.
It provides the function calling ability for Qwen2.5 to an OpenAI-compatible API through templates that is transparent to users.
{#note-official-template}
It's worth noting that since a lot of stuff can be done under the scene with application frameworks, currently the official function calling implementation for Qwen2.5 is very flexible and beyond simple templating, making it hard to adapt it other frameworks that use less capable templating engines.
Before starting, let's make sure the latest library is installed:
```bash
pip install -U qwen-agent
```
For this guide, we are at version v0.0.10.
#### Preparing
Qwen-Agent can wrap an OpenAI-compatible API that does not support function calling.
You can serve such an API with most inference frameworks or obtain one from cloud providers like DashScope or Together.
Assuming there is an OpenAI-compatible API at `http://localhost:8000/v1`, Qwen-Agent provides a shortcut function `get_chat_model` to obtain a model inference class with function calling support:
```python
from qwen_agent.llm import get_chat_model
llm = get_chat_model({
"model": "Qwen/Qwen2.5-7B-Instruct",
"model_server": "http://localhost:8000/v1",
"api_key": "EMPTY",
})
```
In the above, `model_server` is the `api_base` common used in other OpenAI-compatible API clients.
It is advised to provide the `api_key` (but not via plaintext in the code), even if the API server does not check it, in which case, you can set it to anything.
For model inputs, the common message structure for system, user, and assistant history should be used:
```python
messages = MESSAGES[:]
# [
# {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30"},
# {"role": "user", "content": "What's the temperature in San Francisco now? How about tomorrow?"}
# ]
```
We add the current date to the system message so that the "tomorrow" in the user message is anchored.
It can also be added to the user message if one desires.
At the time, Qwen-Agent works with functions instead of tools.
This requires a small change to our tool descriptions, that is, extracting the function fields:
```python
functions = [tool["function"] for tool in TOOLS]
```
#### Tool Calls and Tool Results
To interact with the model, the `chat` method should be used:
```python
for responses in llm.chat(
messages=messages,
functions=functions,
extra_generate_cfg=dict(parallel_function_calls=True),
):
pass
messages.extend(responses)
```
In the above code, the `chat` method receives the `messages`, the `functions`, and an `extra_generate_cfg` parameter.
You can put sampling parameters, such as `temperature`, and `top_p`, in the `extra_generate_cfg`.
Here, we add to it a special control `parallel_function_calls` provided by Qwen-Agent.
As its name suggests, it will enable parallel function calls, which means that the model may generate multiple function calls for a single turn as it deems fit.
The `chat` method returns a generator of list, each of which may contain multiple messages.
Since we enable `parallel_function_calls`, we should get two messages in the responses:
```python
[
{'role': 'assistant', 'content': '', 'function_call': {'name': 'get_current_temperature', 'arguments': '{"location": "San Francisco, CA, USA", "unit": "celsius"}'}},
{'role': 'assistant', 'content': '', 'function_call': {'name': 'get_temperature_date', 'arguments': '{"location": "San Francisco, CA, USA", "date": "2024-10-01", "unit": "celsius"}'}},
]
```
As we can see, Qwen-Agent attempts to parse the model generation in an easier to use structural format.
The details related to function calls are placed in the `function_call` field of the messages:
- `name`: a string representing the function to call
- `arguments`: a JSON-formatted string representing the arguments the function should be called with
Note that Qwen2.5-7B-Instruct is quite capable:
- It has followed the function instructions to add the state and the country to the location.
- It has correctly induced the date of tomorrow and given in the format required by the function.
Then comes the critical part -- checking and applying the function call:
```python3
for message in responses:
if fn_call := message.get("function_call", None):
fn_name: str = fn_call['name']
fn_args: dict = json.loads(fn_call["arguments"])
fn_res: str = json.dumps(get_function_by_name(fn_name)(**fn_args))
messages.append({
"role": "function",
"name": fn_name,
"content": fn_res,
})
```
To get tool results:
- line 1: We should iterate the function calls in the order the model generates them.
- line 2: We can check if a function call is needed as deemed by the model by checking the `function_call` field of the generated messages.
- line 3-4: The related details including the name and the arguments of the function can also be found there, which are `name` and `arguments` respectively.
- line 6: With the details, one should call the function and obtain the results.
Here, we assume there is a function named [`get_function_by_name`](#prepcode) to help us get the related function by its name.
- line 8-12: With the result obtained, add the function result to the messages as `content` and with `role` as `"function"`.
Now the messages are
```python
[
{'role': 'system', 'content': 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30'},
{'role': 'user', 'content': "What's the temperature in San Francisco now? How about tomorrow?"},
{'role': 'assistant', 'content': '', 'function_call': {'name': 'get_current_temperature', 'arguments': '{"location": "San Francisco, CA, USA", "unit": "celsius"}'}},
{'role': 'assistant', 'content': '', 'function_call': {'name': 'get_temperature_date', 'arguments': '{"location": "San Francisco, CA, USA", "date": "2024-10-01", "unit": "celsius"}'}},
{'role': 'function', 'name': 'get_current_temperature', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}'},
{'role': 'function', 'name': 'get_temperature_date', 'content': '{"temperature": 25.9, "location": "San Francisco, CA, USA", "date": "2024-10-01", "unit": "celsius"}'},
]
```
#### Final Response
Finally, run the model again to get the final model results:
```python
for responses in llm.chat(messages=messages, functions=functions):
pass
messages.extend(responses)
```
The final response should be like
```python
{'role': 'assistant', 'content': 'Currently, the temperature in San Francisco is approximately 26.1°C. Tomorrow, on 2024-10-01, the temperature is forecasted to be around 25.9°C.'}
```
### Hugging Face transformers
Since function calling is based on prompt engineering and templates, `transformers` supports it with its tokenizer utilities, in particular, the `tokenizer.apply_chat_template` method, which hides the sophistication of constructing the model inputs, using the Jinja templating engine.
However, it means that users should handle the model output part on their own, which includes parsing the generated function call message.
The blog piece [_Tool Use, Unified_](https://huggingface.co/blog/unified-tool-use) is very helpful in understanding its design.
Be sure to take a look.
Tool use API is available in transformers since v4.42.0.
Before starting, let's check that:
```bash
pip install "transformers>4.42.0"
```
For this guide, we are at version v4.44.2.
#### Preparing
For Qwen2.5, the chat template in `tokenizer_config.json` has already included support for the Hermes-style tool use.
We simply need to load the model and the tokenizer:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name_or_path = "Qwen/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
torch_dtype="auto",
device_map="auto",
)
```
The inputs are the same with those in [the preparation code](#prepcode):
```python
tools = TOOLS
messages = MESSAGES[:]
```
In `transformers`, you can also directly use Python functions as tools with certain constraints[^get_json_schema_note]:
```python
tools = [get_current_temperature, get_temperature_date]
```
[^get_json_schema_note]: `transformers` will use `transformers.utils.get_json_schema` to generate the tool descriptions from Python functions.
There are some gotchas with `get_json_schema`, and it is advised to check [its doc \[v4.44.2\]](https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/utils/chat_template_utils.py#L183-L288) before relying on it.
- The function should use Python type hints for parameter types and has a Google-style docstring for function description and parameter descriptions.
- Supported types are limited, since the types needs to be mapped to JSON Schema.
In particular, `typing.Literal` is not supported.
You can instead add `(choices: ...)` at the end of a parameter description, which will be mapped to a `enum` type in JSON Schema.
Please be aware that all the returned results in the examples in the linked docstring are actually the content of the `function` field in the actual returned results.
#### Tool Calls and Tool Results
To construct the input sequence, we should use the `apply_chat_template` method and then let the model continue the texts:
```python
text = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
output_text = tokenizer.batch_decode(outputs)[0][len(text):]
```
The output texts should be like
```text
<tool_call>
{"name": "get_current_temperature", "arguments": {"location": "San Francisco, CA, USA"}}
</tool_call>
<tool_call>
{"name": "get_temperature_date", "arguments": {"location": "San Francisco, CA, USA", "date": "2024-10-01"}}
</tool_call><|im_end|>
```
Now we need to do two things:
1. Parse the generated tool calls to a message and add them to the messages, so that the model knows which tools are used.
2. Obtain the results of the tools and add them to the messages, so that the model knows the results of the tool calls.
In `transformers`, the tool calls should be a field of assistant messages.
Let's use a simple function called `try_parse_tool_calls` to parse the tool calls:
{#parse-function}
```python
import re
def try_parse_tool_calls(content: str):
"""Try parse the tool calls."""
tool_calls = []
offset = 0
for i, m in enumerate(re.finditer(r"<tool_call>\n(.+)?\n</tool_call>", content)):
if i == 0:
offset = m.start()
try:
func = json.loads(m.group(1))
tool_calls.append({"type": "function", "function": func})
if isinstance(func["arguments"], str):
func["arguments"] = json.loads(func["arguments"])
except json.JSONDecodeError as e:
print(f"Failed to parse tool calls: the content is {m.group(1)} and {e}")
pass
if tool_calls:
if offset > 0 and content[:offset].strip():
c = content[:offset]
else:
c = ""
return {"role": "assistant", "content": c, "tool_calls": tool_calls}
return {"role": "assistant", "content": re.sub(r"<\|im_end\|>$", "", content)}
```
This function does not cover all possible scenarios and thus is prone to errors.
But it should suffice for the purpose of this guide.
:::{note}
The template in the `tokenizer_config.json` assumes that the generated content alongside tool calls is in the same message instead of separate assistant messages, e.g.,
```json
{
"role": "assistant",
"content": "To obtain the current temperature, I should call the functions `get_current_temperate`.",
"tool_calls": [
{"type": "function", "function": {"name": "get_current_temperature", "arguments": {"location": "San Francisco, CA, USA", "unit": "celsius"}}}
]
}
```
instead of
```json
[
{
"role": "assistant",
"content": "To obtain the current temperature, I should call the functions `get_current_temperate`.",
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{"type": "function", "function": {"name": "get_current_temperature", "arguments": {"location": "San Francisco, CA, USA", "unit": "celsius"}}}
]
}
]
```
This is implemented roughly in `try_parse_tool_calls` but keep that in mind if you are writing your own tool call parser.
:::
```python
messages.append(try_parse_tool_calls(output_text))
if tool_calls := messages[-1].get("tool_calls", None):
for tool_call in tool_calls:
if fn_call := tool_call.get("function"):
fn_name: str = fn_call["name"]
fn_args: dict = fn_call["arguments"]
fn_res: str = json.dumps(get_function_by_name(fn_name)(**fn_args))
messages.append({
"role": "tool",
"name": fn_name,
"content": fn_res,
})
```
The messages now should be like
```python
[
{'role': 'system', 'content': 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30'},
{'role': 'user', 'content': "What's the temperature in San Francisco now? How about tomorrow?"},
{'role': 'assistant', 'content': '', 'tool_calls': [
{'type': 'function', 'function': {'name': 'get_current_temperature', 'arguments': {'location': 'San Francisco, CA, USA'}}},
{'type': 'function', 'function': {'name': 'get_temperature_date', 'arguments': {'location': 'San Francisco, CA, USA', 'date': '2024-10-01'}}},
]},
{'role': 'tool', 'name': 'get_current_temperature', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}'},
{'role': 'tool', 'name': 'get_temperature_date', 'content': '{"temperature": 25.9, "location": "San Francisco, CA, USA", "date": "2024-10-01", "unit": "celsius"}'},
]
```
The messages are similar to those of Qwen-Agent, but there are some major differences:
- Tools instead of functions
- Parallel calls are by default
- Multiple tool calls as a list in a single assistant message, instead of multiple messages.
- The function arguments are parsed into a dict if it is a valid JSON-formatted string.
#### Final Response
Then it's time for the model to generate the actual response for us based on the tool results.
Let's query the model again:
```python
text = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
output_text = tokenizer.batch_decode(outputs)[0][len(text):]
```
The output_text should be like
```
The current temperature in San Francisco is approximately 26.1°C. Tomorrow, on October 1, 2024, the temperature is expected to be around 25.9°C.<|im_end|>
```
Add the result text as an assistant message and the final messages should be ready for further interaction:
```python
messages.append(try_parse_tool_calls(output_text))
```
### Ollama
Ollama is a set of tools for serving LLMs locally.
It also relies on its template implementation to support function calling.
Different from transformers, which is written in Python and uses the Jinja template whose syntax is heavily inspired by Django and Python, Ollama, which is mostly written in Go, uses Go's [text/template](https://pkg.go.dev/text/template) packages.
In addition, Ollama implements internally a helper function so that it can automatically parse the generated tool calls in texts to structured messages if the format supported.
You could check the [Tool support](https://ollama.com/blog/tool-support) blog post first.
Tool support has been available in Ollama since v0.3.0.
You can run the following to check the Ollama version:
```bash
ollama -v
```
If lower than expected, follow [the official instructions](https://ollama.com/download) to install the latest version.
In this guide, we will aslo use [ollama-python](https://github.com/ollama/ollama-python), before starting, make sure it is available in your environment:
```bash
pip install ollama
```
For this guide, the `ollama` binary is at v0.3.9 and the `ollama` Python library is at v0.3.2.
#### Preparing
The messages structure used in Ollama is the same with that in `transformers` and the template in [Qwen2.5 Ollama models](https://ollama.com/library/qwen2.5) has supported tool use.
The inputs are the same with those in [the preparation code](#prepcode):
```python
tools = TOOLS
messages = MESSAGES[:]
model_name = "qwen2.5:7b"
```
Note that you cannot pass Python functions as tools directly and `tools` has to be a `dict`.
#### Tool Calls and Tool Results
We can use the `ollama.chat` method to directly query the underlying API:
```python
import ollama
response = ollama.chat(
model=model_name,
messages=messages,
tools=tools,
)
```
The main fields in the response could be:
```python
{
'model': 'qwen2.5:7b',
'message': {
'role': 'assistant',
'content': '',
'tool_calls': [
{'function': {'name': 'get_current_temperature', 'arguments': {'location': 'San Francisco, CA, USA'}}},
{'function': {'name': 'get_temperature_date', 'arguments': {'date': '2024-10-01', 'location': 'San Francisco, CA, USA'}}},
],
},
}
```
Ollama's tool call parser has succeeded in parsing the tool results.
If not, you may refine [the `try_parse_tool_calls` function above](#parse-function).
Then, we can obtain the tool results and add them to the messages.
The following is basically the same with `transformers`:
```python
messages.append(response["message"])
if tool_calls := messages[-1].get("tool_calls", None):
for tool_call in tool_calls:
if fn_call := tool_call.get("function"):
fn_name: str = fn_call["name"]
fn_args: dict = fn_call["arguments"]
fn_res: str = json.dumps(get_function_by_name(fn_name)(**fn_args))
messages.append({
"role": "tool",
"name": fn_name,
"content": fn_res,
})
```
The messages are now like
```python
[
{'role': 'system', 'content': 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30'},
{'role': 'user', 'content': "What's the temperature in San Francisco now? How about tomorrow?"},
{'role': 'assistant', 'content': '', 'tool_calls': [
{'function': {'name': 'get_current_temperature', 'arguments': {'location': 'San Francisco, CA, USA'}}},
{'function': {'name': 'get_temperature_date', 'arguments': {'date': '2024-10-01', 'location': 'San Francisco, CA, USA'}}},
]},
{'role': 'tool', 'name': 'get_current_temperature', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}'},
{'role': 'tool', 'name': 'get_temperature_date', 'content': '{"temperature": 25.9, "location": "San Francisco, CA, USA", "date": "2024-10-01", "unit": "celsius"}'},
]
```
#### Final Response
The rest are easy:
```python
response = ollama.chat(
model=model_name,
messages=messages,
tools=tools,
)
messages.append(response["message"])
```
The final message should be like the following:
```python
{'role': 'assistant', 'content': 'The current temperature in San Francisco is approximately 26.1°C. For tomorrow, October 1st, 2024, the forecasted temperature will be around 25.9°C.'}
```
(heading-target)=
### vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving.
It uses the tokenizer from `transformers` to format the input, so we should have no trouble preparing the input.
In addition, vLLm also implements helper functions so that generated tool calls can be parsed automatically if the format is supported.
Tool support has been available in `vllm` since v0.6.0.
Be sure to install a version that supports tool use.
For more information, check the [vLLM documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html#tool-calling-in-the-chat-completion-api).
For this guide, we are at version v0.6.1.post2.
We will use the OpenAI-Compatible API by `vllm` with the API client from the `openai` Python library.
#### Preparing
For Qwen2.5, the chat template in tokenizer_config.json has already included support for the Hermes-style tool use.
We simply need to start a OpenAI-compatible API with vLLM:
```bash
vllm serve Qwen/Qwen2.5-7B-Instruct --enable-auto-tool-choice --tool-call-parser hermes
```
The inputs are the same with those in [the preparation code](#prepcode):
```python
tools = TOOLS
messages = MESSAGES[:]
```
Let's also initialize the client:
```python
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model_name = "Qwen/Qwen2.5-7B-Instruct"
```
#### Tool Calls and Tool Results
We can use the create chat completions endpoint to query the model:
```python
response = client.chat.completions.create(
model=model_name,
messages=messages,
tools=tools,
temperature=0.7,
top_p=0.8,
max_tokens=512,
extra_body={
"repetition_penalty": 1.05,
},
)
```
vLLM should be able to parse the tool calls for us, and the main fields in the response (`response.choices[0]`) should be like
```python
Choice(
finish_reason='tool_calls',
index=0,
logprobs=None,
message=ChatCompletionMessage(
content=None,
role='assistant',
function_call=None,
tool_calls=[
ChatCompletionMessageToolCall(
id='chatcmpl-tool-924d705adb044ff88e0ef3afdd155f15',
function=Function(arguments='{"location": "San Francisco, CA, USA"}', name='get_current_temperature'),
type='function',
),
ChatCompletionMessageToolCall(
id='chatcmpl-tool-7e30313081944b11b6e5ebfd02e8e501',
function=Function(arguments='{"location": "San Francisco, CA, USA", "date": "2024-10-01"}', name='get_temperature_date'),
type='function',
),
],
),
stop_reason=None,
)
```
Note that the function arguments are JSON-formatted strings, which Qwen-Agent follows but `transformers` and Ollama differs.
As before, chances are that there are corner cases where tool calls are generated but they are malformed and cannot be parsed.
For production code, we should try parsing by ourselves.
Then, we can obtain the tool results and add them to the messages as shown below:
```python
messages.append(response.choices[0].message.model_dump())
if tool_calls := messages[-1].get("tool_calls", None):
for tool_call in tool_calls:
call_id: str = tool_call["id"]
if fn_call := tool_call.get("function"):
fn_name: str = fn_call["name"]
fn_args: dict = json.loads(fn_call["arguments"])
fn_res: str = json.dumps(get_function_by_name(fn_name)(**fn_args))
messages.append({
"role": "tool",
"content": fn_res,
"tool_call_id": call_id,
})
```
It should be noted that the OpenAI API uses `tool_call_id` to identify the relation between tool results and tool calls.
The messages are now like
```python
[
{'role': 'system', 'content': 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.\n\nCurrent Date: 2024-09-30'},
{'role': 'user', 'content': "What's the temperature in San Francisco now? How about tomorrow?"},
{'content': None, 'role': 'assistant', 'function_call': None, 'tool_calls': [
{'id': 'chatcmpl-tool-924d705adb044ff88e0ef3afdd155f15', 'function': {'arguments': '{"location": "San Francisco, CA, USA"}', 'name': 'get_current_temperature'}, 'type': 'function'},
{'id': 'chatcmpl-tool-7e30313081944b11b6e5ebfd02e8e501', 'function': {'arguments': '{"location": "San Francisco, CA, USA", "date": "2024-10-01"}', 'name': 'get_temperature_date'}, 'type': 'function'},
]},
{'role': 'tool', 'content': '{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}', 'tool_call_id': 'chatcmpl-tool-924d705adb044ff88e0ef3afdd155f15'},
{'role': 'tool', 'content': '{"temperature": 25.9, "location": "San Francisco, CA, USA", "date": "2024-10-01", "unit": "celsius"}', 'tool_call_id': 'chatcmpl-tool-7e30313081944b11b6e5ebfd02e8e501'},
]
```
#### Final Response
Let's call the endpoint again to seed the tool results and get response:
```python
response = client.chat.completions.create(
model=model_name,
messages=messages,
tools=tools,
temperature=0.7,
top_p=0.8,
max_tokens=512,
extra_body={
"repetition_penalty": 1.05,
},
)
messages.append(response.choices[0].message.model_dump())
```
The final response (`response.choices[0].message.content`) should be like
```text
The current temperature in San Francisco is approximately 26.1°C. For tomorrow, the forecasted temperature is around 25.9°C.
```
### Discussions
Now, we have introduced how to conduct inference with function calling using Qwen2 in three different frameworks!
Let's make a brief comparison.
| Item | OpenAI API | Hugging Face transformers | Ollama | vLLM | Qwen-Agent |
| :----- | :---: | :---: | :---: | :---: | :---: |
| Type | HTTP API | Python Library | HTTP API | HTTP API | Python Library |
| Inference Backend | - | PyTorch | llama.cpp | PyTorch | HTTP API |
| Templating Backend | - | Jinja | Go `text/template` | Jinja | Python |
| Tools/Functions | Tools | Tools | Tools | Tools | Functions |
| Parallel Calls | Default Yes (Configurable) | Yes | Yes | Yes | Default No (Configurable) |
| Call Format | Single assistant message with `tool_calls` | Single assistant message with `tool_calls` | Single assistant message with `tool_calls` | Single assistant message with `tool_calls` | Multiple assistant messages with `function_call` |
| Call Argument Format | string | object | object | string | string |
| Call Result Format | Multiple tool messages with `content` | Multiple tool messages with `content` | Multiple tool messages with `content` | Multiple tool messages with `content` | Multiple function messages with `content` |
There are some details not shown in the above table:
- OpenAI API comes with Python, Node.js, Go, and .NET SDKs. It also follows the OpenAPI standard.
- Ollama comes with Python and Node.js SDKs. It has OpenAI-compatible API at a different base url that can be accessed using OpenAI API SDK.
- Qwen-Agent as an application framework can call the tools automatically for you, which is introduced in [the Qwen-Agent guide](./qwen_agent).
In addition, there are more on the model side of function calling, which means you may need to consider more things in production code:
- **Accuracy of function calling**:
When it comes to evaluate the accuracy of function calling, there are two aspects:
(a) whether the correct functions (including no functions) are selected and
(b) whether the correct function arguments are generated.
It is not always the case that Qwen2.5 will be accurate.
Function calling can involve knowledge that is deep and domain-specific.
Sometimes, it doesn't fully understand the function and select the wrong one by mistake.
Sometimes, it can fall into a loop and require calling the same function again and again.
Sometimes, it will fabricate required function arguments instead of asking the user for input.
To improve the function calling accuracy, it is advised to first try prompt engineering:
does a more detailed function description help?
can we provide instructions and examples to the model in the system message?
If not, finetuning on your own data could also improve performance.
- **Protocol consistency**:
Even with the proper function calling template, the protocol may break.
The model may generate extra texts to tool calls, e.g., explanations.
The generated tool call may be invalid JSON-formatted string but a representation of a Python dict
The generated tool call may be valid JSON but not conforms to the provided JSON Schema.
For those kinds of issues, while some of them could be addressed with prompt engineering, some are caused by the nature of LLMs and can be hard to resolve in a general manner by LLMs themselves.
While we strive to improve Qwen2.5 in this regard, edge cases are unlikely to be eliminated completely.
## Function Calling Templates
The template design for function calling often includes the following aspects:
- How to describe the functions to the model, so that the model understands what they are and how to use them.
- How to prompt the model, so that it knows that functions can be used and in what format to generate the function calls.
- How to tell a function call generation from others in generated text, so that we can extract the calls from the generated texts and actually make the calls.
- How to incorporate the function results to the text, so that the model can tell them from its own generation and make connection among the calls and the results.
For experienced prompt engineers, it should be possible to make any LLM support function calling, using in-context learning techniques and with representative examples, though with varied accuracy and stability depending on how "zero-shot" the task at hand is.
### Starting from ReAct Prompting
For example, ReAct Prompting can be used to implement function calling with an extra element of planning:
- **Thought**: the overt reasoning path, analyzing the functions and the user query and saying it out "loud"
- **Action**: the function to use and the arguments with which the function should be called
- **Observation**: the results of the function
In fact, Qwen2 is verse in the following variant of ReAct Prompting (similar to LangChain ReAct) to make the intermediate texts more structured:
```
Answer the following questions as best you can. You have access to the following tools:
{function_name}: Call this tool to interact with the {function_name_human_readable} API. What is the {function_name_human_readable} API useful for? {function_desciption} Parameters: {function_parameter_descriptions} {argument_formatting_instructions}
{function_name}: Call this tool to interact with the {function_name_human_readable} API. What is the {function_name_human_readable} API useful for? {function_desciption} Parameters: {function_parameter_descriptions} {argument_formatting_instructions}
Use the following format:
Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{function_name},{function_name}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question
Begin!
Question: {query}
Thought: {some_text}
Action: {function_name}
Action Input: {function_arguments}
Observation: {function_results}
Final Answer: {response}
```
As you can see, there is no apparent user/assistant conversation structure in the template.
The model will simply continue the texts.
One should write the code to actively detect which step the model is at and in particular to add the observations in the process, until the Final Answer is generated.
However, as most programming interfaces accept the message structure, there should be some kind of adapter between the two.
[The ReAct Chat Agent](https://github.com/QwenLM/Qwen-Agent/blob/v0.0.10/qwen_agent/agents/react_chat.py) in Qwen-Agent facilitates this kind of conversion.
### Qwen2 Function Calling Template
As a step forward, the official Qwen2 function calling template is in the vein of the ReAct Prompting format but focuses more on
- differentiating the keywords like `Question`, `Thought`, `Action`, etc., from generation,
- simplifying the process,
- supporting better multi-turn conversation, and
- adding controls for specialized usage.
An equivalent example would be
```
<|im_start|>system
{system message}
## Tools
You have access to the following tools:
### {function_name_human_readable}
{function_name}: {function_description} Parameters: {function_parameter_descriptions} {argument_formatting_instructions}
### {function_name_human_readable}
{function_name}: {function_description} Parameters: {function_parameter_descriptions} {argument_formatting_instructions}
## When you need to call a tool, please insert the following command in your reply, which can be called zero or multiple times according to your needs:
✿FUNCTION✿: The tool to use, should be one of [{function_name},{function_name}]
✿ARGS✿: The input of the tool
✿RESULT✿: Tool results
✿RETURN✿: Reply based on tool results. Images need to be rendered as ![](url)<|im_end|>
<|im_start|>user
{query}<|im_end|>
<|im_start|>assistant
✿FUNCTION✿: {function_name}
✿ARGS✿: {function_arguments}
✿RESULT✿: {function_result}
✿RETURN✿:{response}<|im_end|>
```
Let's first list the obvious differences:
- Keywords (`✿FUNCTION✿`, `✿ARGS✿`, etc.) seem rare in ordinary text and more semantically related to function calling, but not special tokens yet.
- Thought is omitted. This could affect accuracy for some use cases.
- Use the system-user-assistant format for multi-turn conversations. Function calling prompting is moved to the system message.
How about adding controls for specialized usage?
The template actually has the following variants:
- Language: the above is for non-Chinese language; there is another template in Chinese.
- Parallel Calls: the above is for non-parallel calls; there is another template for parallel calls.
In the canonical implementation in Qwen-Agent, those switches are implemented in Python, according to the configuration and current input.
The actual text with _parallel calls_ should be like the following:
```text
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
Current Date: 2024-09-30
## Tools
You have access to the following tools:
### get_current_temperature
get_current_temperature: Get current temperature at a location. Parameters: {"type": "object", "properties": {"location": {"type": "string", "description": "The location to get the temperature for, in the format \"City, State, Country\"."}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit to return the temperature in. Defaults to \"celsius\"."}}, "required": ["location"]} Format the arguments as a JSON object.
### get_temperature_date
get_temperature_date: Get temperature at a location and date. Parameters: {"type": "object", "properties": {"location": {"type": "string", "description": "The location to get the temperature for, in the format \"City, State, Country\"."}, "date": {"type": "string", "description": "The date to get the temperature for, in the format \"Year-Month-Day\"."}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit to return the temperature in. Defaults to \"celsius\"."}}, "required": ["location", "date"]} Format the arguments as a JSON object.
## Insert the following command in your reply when you need to call N tools in parallel:
✿FUNCTION✿: The name of tool 1, should be one of [get_current_temperature,get_temperature_date]
✿ARGS✿: The input of tool 1
✿FUNCTION✿: The name of tool 2
✿ARGS✿: The input of tool 2
...
✿FUNCTION✿: The name of tool N
✿ARGS✿: The input of tool N
✿RESULT✿: The result of tool 1
✿RESULT✿: The result of tool 2
...
✿RESULT✿: The result of tool N
✿RETURN✿: Reply based on tool results. Images need to be rendered as ![](url)<|im_end|>
<|im_start|>user
What's the temperature in San Francisco now? How about tomorrow?<|im_end|>
<|im_start|>assistant
✿FUNCTION✿: get_current_temperature
✿ARGS✿: {"location": "San Francisco, CA, USA"}
✿FUNCTION✿: get_temperature_date
✿ARGS✿: {"location": "San Francisco, CA, USA", "date": "2024-10-01"}
✿RESULT✿: {"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}
✿RESULT✿: {"temperature": 25.9, "location": "San Francisco, CA, USA", "date": "2024-10-01", "unit": "celsius"}
✿RETURN✿: The current temperature in San Francisco is approximately 26.1°C. For tomorrow, October 1st, 2024, the forecasted temperature will be around 25.9°C.<|im_end|>
```
This template is hard to adapt it for other frameworks that use less capable templating engines.
But it is doable at least partially for Jinja, which is Python-oriented after all.
We didn't use it because using the template in `transformers` leads to more changes to the inference usage, which are not very common for beginners.
For the interested, you can find the Jinja template and key points on usage below:
:::{dropdown} Qwen2 Function Calling Jinja Template
```jinja
{%- if messages[0]["role"] == "system" %}
{%- set system_message = messages[0]["content"] %}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set system_message = "You are a helpful assistant." %}
{%- set loop_messages = messages %}
{%- endif %}
{%- if parallel_tool_calls is undefined %}
{%- set parallel_tool_calls = false %}
{%- endif %}
{%- if language is undefined or language != "zh" %}
{%- set language = "en" %}
{%- endif %}
{{- "<|im_start|>system\n" + system_message|trim }}
{%- if tools is defined %}
{{- "\n\n# 工具\n\n## 你拥有如下工具:\n\n" if language == "zh" else "\n\n## Tools\n\nYou have access to the following tools:\n\n" }}
{%- set functions = tools|map(attribute="function")|list %}
{%- set function_names = functions|map(attribute="name")|join(",") %}
{%- for function in functions %}
{{- "### " + function.name + "\n\n" + function.name + ": " + function.description + (" 输入参数:" if language == "zh" else " Parameters: ") + function.parameters|tojson + (" 此工具的输入应为JSON对象。\n\n" if language == "zh" else " Format the arguments as a JSON object.\n\n") }}
{%- endfor %}
{%- if parallel_tool_calls and language == "zh" %}
{{- "## 你可以在回复中插入以下命令以并行调用N个工具:\n\n✿FUNCTION✿: 工具1的名称,必须是[" + function_names + "]之一\n✿ARGS✿: 工具1的输入\n✿FUNCTION✿: 工具2的名称\n✿ARGS✿: 工具2的输入\n...\n✿FUNCTION✿: 工具N的名称\n✿ARGS✿: 工具N的输入\n✿RESULT✿: 工具1的结果\n✿RESULT✿: 工具2的结果\n...\n✿RESULT✿: 工具N的结果\n✿RETURN✿: 根据工具结 果进行回复,需将图片用![](url)渲染出来" }}
{%- elif parallel_tool_calls %}
{{- "## Insert the following command in your reply when you need to call N tools in parallel:\n\n✿FUNCTION✿: The name of tool 1, should be one of [" + function_names + "]\n✿ARGS✿: The input of tool 1\n✿FUNCTION✿: The name of tool 2\n✿ARGS✿: The input of tool 2\n...\n✿FUNCTION✿: The name of tool N\n✿ARGS✿: The input of tool N\n✿RESULT✿: The result of tool 1\n✿RESULT✿: The result of tool 2\n...\n✿RESULT✿: The result of tool N\n✿RETURN✿: Reply based on tool results. Images need to be rendered as ![](url)" }}
{%- elif language == "zh" %}
{{- "## 你可以在回复中插入零次、一次或多次以下命令以调用工具:\n\n✿FUNCTION✿: 工具名称,必须是[" + function_names + "]之一。\n✿ARGS✿: 工具输入\n✿RESULT✿: 工具结果\n✿RETURN✿: 根据工具结果进行回复,需将图片用![](url)渲染出来" }}
{%- else %}
{{- "## When you need to call a tool, please insert the following command in your reply, which can be called zero or multiple times according to your needs:\n\n✿FUNCTION✿: The tool to use, should be one of [" + function_names + "]\n✿ARGS✿: The input of the tool\n✿RESULT✿: Tool results\n✿RETURN✿: Reply based on tool results. Images need to be rendered as ![](url)" }}
{%- endif %}
{%- endif %}
{{- "<|im_end|>" }}
{%- for message in loop_messages %}
{%- if message.role == "user" %}
{{- "\n<|im_start|>" + message.role + "\n" + message.content + "<|im_end|>" }}
{%- if loop.last and add_generation_prompt %}
{{- "\n<|im_start|>assistant\n" }}
{%- endif %}
{%- elif message.role == "tool" %}
{{- "✿RESULT✿: " + message.content + "\n" }}
{%- if loop.last and add_generation_prompt %}
{{- "✿RETURN✿:" }}
{%- endif %}
{%- elif message.role == "assistant" and message.tool_calls is defined %}
{%- if loop.previtem.role == "user" %}
{{- "\n<|im_start|>assistant\n" }}
{%- endif %}
{%- for function in message.tool_calls|map(attribute="function") %}
{{- "✿FUNCTION✿: " + function.name + "\n✿ARGS✿: " + function.arguments|tojson + "\n" }}
{%- endfor %}
{%- elif message.role == "assistant" %}
{%- if loop.previtem.role == "user" %}
{{- "\n<|im_start|>assistant\n" }}
{%- elif loop.previtem.role == "tool" %}
{{- "✿RETURN✿:" }}
{%- endif %}
{{- message.content }}
{%- if loop.nextitem is undefined or loop.nextitem.role == "user" %}
{{- "<|im_end|>" }}
{%- endif %}
{%- else %}
{{- "\n<|im_start|>" + message.role + "\n" + message.content + "<|im_end|>" }}
{%- endif %}
{%- endfor %}
```
To use this template in `transformers`:
- Switches can be enabled by passing them to the `apply_chat_template` method, e.g., `tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True, parallel_tool_call=True, language="zh", tokenize=False)`. By default, it is for English non-parallel function calling.
- The tool arguments should be a Python `dict` instead of a JSON-formatted object `str`.
- Since the generation needs to be stopped at `✿RESULT✿` or else the model will generate fabricated tool results, we should add it to `stop_strings` in `generation_config`:
```python
model.generation_config.stop_strings = ["✿RESULT✿:", "✿RETURN✿:"]
```
- As a result of using `stop_strings`, you need to pass the tokenizer to `model.generate` as `model.generate(**inputs, tokenizer=tokenizer, max_new_tokens=512)`.
- `response`, i.e., the model generation based on the tool calls and tool results, may contain a leading space. You should not strip it for the model. It is resulted from the tokenization and the template design.
- The `try_parse_tool_calls` function should also be modified accordingly.
:::
### Qwen2.5 Function Calling Templates
For `transformers` and Ollama, we have also used templates that are easier to implement with Jinja or Go.
They are variants of [the Nous Research's Hermes function calling template](https://github.com/NousResearch/Hermes-Function-Calling#prompt-format-for-function-calling).
The Jinja template and the Go template should produce basically the same results.
They final text should look like the following:
```text
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
Current Date: 2024-09-30
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "get_current_temperature", "description": "Get current temperature at a location.", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The location to get the temperature for, in the format \"City, State, Country\"."}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit to return the temperature in. Defaults to \"celsius\"."}}, "required": ["location"]}}}
{"type": "function", "function": {"name": "get_temperature_date", "description": "Get temperature at a location and date.", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The location to get the temperature for, in the format \"City, State, Country\"."}, "date": {"type": "string", "description": "The date to get the temperature for, in the format \"Year-Month-Day\"."}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "The unit to return the temperature in. Defaults to \"celsius\"."}}, "required": ["location", "date"]}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call><|im_end|>
<|im_start|>user
What's the temperature in San Francisco now? How about tomorrow?<|im_end|>
<|im_start|>assistant
<tool_call>
{"name": "get_current_temperature", "arguments": {"location": "San Francisco, CA, USA"}}
</tool_call>
<tool_call>
{"name": "get_temperature_date", "arguments": {"location": "San Francisco, CA, USA", "date": "2024-10-01"}}
</tool_call><|im_end|>
<|im_start|>user
<tool_response>
{"temperature": 26.1, "location": "San Francisco, CA, USA", "unit": "celsius"}
</tool_response>
<tool_response>
{"temperature": 25.9, "location": "San Francisco, CA, USA", "date": "2024-10-01", "unit": "celsius"}
</tool_response><|im_end|>
<|im_start|>assistant
The current temperature in San Francisco is approximately 26.1°C. Tomorrow, on October 1, 2024, the temperature is expected to be around 25.9°C.<|im_end|>
```
While the text may seem different from the previous one, the basic prompting structure is still the same.
There are just more structural tags and more JSON-formatted strings.
---
There is one thing we haven't talked about: how should functions be described to the LLMs.
In short, you could describe them as you would normally describe them in an API documentation, as long as you can effectively parse, validate, and execute the tool calls generated by the models.
The format with JSON Schema appears a valid and common choice.
## Finally
In whichever way you choose to use function calling with Qwen2.5, keep in mind that the limitation and the perks of prompt engineering applies:
- It is not guaranteed that the model generation will always follow the protocol even with proper prompting or templates.
Especially, for the templates that are more complex and relies more on the model itself to think and stay on track than the ones that are simpler and relies on the template and the use of control or special tokens.
The latter one, of course, requires some kind of training.
In production code, be prepared that if it breaks, countermeasures or rectifications are in place.
- If in certain scenarios, the generation is not up to expectation, you can refine the template to add more instructions or constraints.
While the templates mentioned here are general enough, they may not be the best or the most specific or the most concise for your use cases.
The ultimate solution is fine-tuning using your own data.
Have fun prompting!
Qwen-Agent
==========
.. attention::
To be updated for Qwen3.
`Qwen-Agent <https://github.com/QwenLM/Qwen-Agent>`__ is a framework for
developing LLM applications based on the instruction following, tool
usage, planning, and memory capabilities of Qwen. It also comes with
example applications such as Browser Assistant, Code Interpreter, and
Custom Assistant.
Installation
------------
.. code:: bash
git clone https://github.com/QwenLM/Qwen-Agent.git
cd Qwen-Agent
pip install -e ./
Developing Your Own Agent
-------------------------
Qwen-Agent provides atomic components such as LLMs and prompts, as well
as high-level components such as Agents. The example below uses the
Assistant component as an illustration, demonstrating how to add custom
tools and quickly develop an agent that uses tools.
.. code:: py
import json
import os
import json5
import urllib.parse
from qwen_agent.agents import Assistant
from qwen_agent.tools.base import BaseTool, register_tool
llm_cfg = {
# Use the model service provided by DashScope:
'model': 'qwen-max',
'model_server': 'dashscope',
# 'api_key': 'YOUR_DASHSCOPE_API_KEY',
# It will use the `DASHSCOPE_API_KEY' environment variable if 'api_key' is not set here.
# Use your own model service compatible with OpenAI API:
# 'model': 'Qwen/Qwen2.5-7B-Instruct',
# 'model_server': 'http://localhost:8000/v1', # api_base
# 'api_key': 'EMPTY',
# (Optional) LLM hyperparameters for generation:
'generate_cfg': {
'top_p': 0.8
}
}
system = 'According to the user\'s request, you first draw a picture and then automatically run code to download the picture ' + \
'and select an image operation from the given document to process the image'
# Add a custom tool named my_image_gen:
@register_tool('my_image_gen')
class MyImageGen(BaseTool):
description = 'AI painting (image generation) service, input text description, and return the image URL drawn based on text information.'
parameters = [{
'name': 'prompt',
'type': 'string',
'description': 'Detailed description of the desired image content, in English',
'required': True
}]
def call(self, params: str, **kwargs) -> str:
prompt = json5.loads(params)['prompt']
prompt = urllib.parse.quote(prompt)
return json.dumps(
{'image_url': f'https://image.pollinations.ai/prompt/{prompt}'},
ensure_ascii=False)
tools = ['my_image_gen', 'code_interpreter'] # code_interpreter is a built-in tool in Qwen-Agent
bot = Assistant(llm=llm_cfg,
system_message=system,
function_list=tools,
files=[os.path.abspath('doc.pdf')])
messages = []
while True:
query = input('user question: ')
messages.append({'role': 'user', 'content': query})
response = []
for response in bot.run(messages=messages):
print('bot response:', response)
messages.extend(response)
The framework also provides more atomic components for developers to
combine. For additional showcases, please refer to
`examples <https://github.com/QwenLM/Qwen-Agent/tree/main/examples>`__.
# Key Concepts
:::{attention}
To be updated for Qwen3.
:::
## Qwen
Qwen (Chinese: 通义千问; pinyin: _Tongyi Qianwen_) is the large language model and large multimodal model series of the Qwen Team, Alibaba Group.
Qwen is capable of natural language understanding, text generation, vision understanding, audio understanding, tool use, role play, playing as AI agent, etc.
Both language models and multimodal models are pre-trained on large-scale multilingual and multimodal data and post-trained on quality data for aligning to human preferences.
There is the proprietary version and the open-weight version.
The proprietary versions include
- Qwen: the language models
- Qwen Max
- Qwen Plus
- Qwen Turbo
- Qwen-VL: the vision-language models
- Qwen-VL Max
- Qwen-VL Plus
- Qwen-VL OCR
- Qwen-Audio: the audio-language models
- Qwen-Audio Turbo
- Qwen-Audio ASR
You can learn more about them at Alibaba Cloud Model Studio ([China Site](https://help.aliyun.com/zh/model-studio/getting-started/models#9f8890ce29g5u) \[zh\], [International Site](https://www.alibabacloud.com/en/product/modelstudio)).
The spectrum for the open-weight models spans over
- Qwen: the language models
- [Qwen](https://github.com/QwenLM/Qwen): 1.8B, 7B, 14B, and 72B models
- [Qwen1.5](https://github.com/QwenLM/Qwen1.5/tree/v1.5): 0.5B, 1.8B, 4B, 14BA2.7B, 7B, 14B, 32B, 72B, and 110B models
- [Qwen2](https://github.com/QwenLM/Qwen2/tree/v2.0): 0.5B, 1.5B, 7B, 57A14B, and 72B models
- [Qwen2.5](https://github.com/QwenLM/Qwen2.5/): 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B models
- Qwen-VL: the vision-language models
- [Qwen-VL](https://github.com/QwenLM/Qwen-VL): 7B-based models
- [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL): 2B, 7B, and 72B-based models
- Qwen-Audio: the audio-language models
- [Qwen-Audio](https://github.com/QwenLM/Qwen-Audio): 7B-based model
- [Qwen2-Audio](https://github.com/QwenLM/Qwen2-Audio): 7B-based models
- Q*Q: the reasoning models
- [QwQ-Preview](https://github.com/QwenLM/Qwen2.5/): 32B LLM
- [QVQ-Preview](https://github.com/QwenLM/Qwen2-VL): 72B VLM
- CodeQwen/Qwen-Coder: the language models for coding
- [CodeQwen1.5](https://github.com/QwenLM/CodeQwen1.5): 7B models
- [Qwen2.5-Coder](https://github.com/QwenLM/Qwen2.5-Coder): 0.5B, 1.5B, 3B, 7B, 14B, and 32B models
- Qwen-Math: the language models for mathematics
- [Qwen2-Math](https://github.com/QwenLM/Qwen2-Math): 1.5B, 7B, and 72B models
- [Qwen2.5-Math](https://github.com/QwenLM/Qwen2.5-Math): 1.5B, 7B, and 72B models
- Qwen-Math-RM: the reward models for mathematics
- [Qwen2-Math-RM](https://github.com/QwenLM/Qwen2-Math): 72B models
- [Qwen2.5-Math-RM](https://github.com/QwenLM/Qwen2.5-Math): 72B models
- [Qwen2.5-Math-PRM](https://github.com/QwenLM/Qwen2.5-Math): 7B and 72B models
**In this document, our focus is Qwen, the language models.**
## Causal Language Models
Causal language models, also known as autoregressive language models or decoder-only language models, are a type of machine learning model designed to predict the next token in a sequence based on the preceding tokens.
In other words, they generate text one token at a time, using the previously generated tokens as context.
The "causal" aspect refers to the fact that the model only considers the past context (the already generated tokens) when predicting the next token, not any future tokens.
Causal language models are widely used for various natural language processing tasks involving text completion and generation.
They have been particularly successful in generating coherent and contextually relevant text, making them a cornerstone of modern natural language understanding and generation systems.
**Takeaway: Qwen models are causal language models suitable for text completion.**
:::{dropdown} Learn more about language models
They are three main kinds of models that are commonly referred to as language models in deep learning:
- Sequence-to-sequence models: T5 and the likes
Sequence-to-sequence models use both an encoder to capture the entire input sequence and a decoder to generate an output sequence.
They are widely used for tasks like machine translation, text summarization, etc.
- Bidirectional models or encoder-only models: BERT and the likes
Bidirectional models can access both past and future context in a sequence during training.
They cannot generate sequential outputs in real-time due to the need for future context.
They are widely used as embedding models and subsequently used for text classification.
- Casual language models or decoder-only models: GPT and the likes
Causal language models operate unidirectionally in a strictly forward direction, predicting each subsequent word based only on the previous words in the sequence.
This unidirectional nature ensures that the model's predictions do not rely on future context, making them suitable for tasks like text completion and generation.
:::
### Pre-training & Base models
Base language models are foundational models trained on extensive corpora of text to predict the next word in a sequence.
Their main goal is to capture the statistical patterns and structures of language, enabling them to generate coherent and contextually relevant text.
These models are versatile and can be adapted to various natural language processing tasks through fine-tuning.
While adept at producing fluent text, they may require in-context learning or additional training to follow specific instructions or perform complex reasoning tasks effectively.
For Qwen models, the base models are those without "-Instruct" indicators, such as Qwen2.5-7B and Qwen2.5-72B.
**Takeaway: Use base models for in-context learning, downstream fine-tuning, etc.**
### Post-training & Instruction-tuned models
Instruction-tuned language models are specialized models designed to understand and execute specific instructions in conversational styles.
These models are fine-tuned to interpret user commands accurately and can perform tasks such as summarization, translation, and question answering with improved accuracy and consistency.
Unlike base models, which are trained on large corpora of text, instruction-tuned models undergo additional training using datasets that contain examples of instructions and their desired outcomes, often in multiple turns.
This kind of training makes them ideal for applications requiring targeted functionalities while maintaining the ability to generate fluent and coherent text.
For Qwen models, the instruction-tuned models are those with the "-Instruct" suffix, such as Qwen2.5-7B-Instruct and Qwen2.5-72B-Instruct. [^instruct-chat]
**Takeaway: Use instruction-tuned models for conducting tasks in conversations, downstream fine-tuning, etc.**
[^instruct-chat]: Previously, they are known as the chat models and with the "-Chat" suffix. Starting from Qwen2, the name is changed to follow the common practice. For Qwen, "-Instruct" and "-Chat" should be regarded as synonymous.
## Tokens & Tokenization
Tokens represent the fundamental units that models process and generate.
They can represent texts in human languages (regular tokens) or represent specific functionality like keywords in programming languages (control tokens [^special]).
Typically, a tokenizer is used to split text into regular tokens, which can be words, subwords, or characters depending on the specific tokenization scheme employed, and furnish the token sequence with control tokens as needed.
The vocabulary size, or the total number of unique tokens a model recognizes, significantly impacts its performance and versatility.
Larger language models often use sophisticated tokenization methods to handle the vast diversity of human language while keeping the vocabulary size manageable.
Qwen use a relatively large vocabulary of 151,646 tokens in total.
[^special]: Control tokens can be called special tokens. However, the meaning of special tokens need to be interpreted based on the contexts: special tokens may contain extra regular tokens.
**Takeaway: Tokenization method and vocabulary size is important.**
### Byte-level Byte Pair Encoding
Qwen adopts a subword tokenization method called Byte Pair Encoding (BPE), which attempts to learn the composition of tokens that can represent the text with the fewest tokens.
For example, the string " tokenization" is decomposed as " token" and "ization" (note that the space is part of the token).
Especially, the tokenization of Qwen ensures that there is no unknown words and all texts can be transformed to token sequences.
There are 151,643 tokens as a result of BPE in the vocabulary of Qwen, which is a large vocabulary efficient for diverse languages.
As a rule of thumb, 1 token is 3~4 characters for English texts and 1.5~1.8 characters for Chinese texts.
**Takeaway: Qwen processes texts in subwords and there are no unknown words.**
:::{dropdown} Learn more about tokenization in Qwen
Qwen uses byte-level BPE (BBPE) on UTF-8 encoded texts.
It starts by treating each byte as a token and then iteratively merges the most frequent pairs of tokens occurring the texts into larger tokens until the desired vocabulary size is met.
In byte-level BPE, minimum 256 tokens are needed to tokenize every piece of text and avoid the out of vocabulary (OOV) problem.
In comparison, character-level BPE needs every Unicode character in its vocabulary to avoid OOV and the Unicode Standard contains 154,998 characters as of Unicode Version 16.0.
One limitation to keep in mind for byte-level BPE is that the individual tokens in the vocabulary may not be seemingly semantically meaningful or even valid UTF-8 byte sequences, and in certain aspects, they should be viewed as a text compression scheme.
:::
### Control Tokens & Chat Template
Control tokens and chat templates both serve as mechanisms to guide the model's behavior and outputs.
Control tokens are special tokens inserted into the sequence that signifies meta information.
For example, in pre-training, multiple documents may be packed into a single sequence.
For Qwen, the control token "<|endoftext|>" is inserted after each document to signify that the document has ended and a new document will proceed.
Chat templates provide a structured format for conversational interactions, where predefined placeholders or prompts are used to elicit responses from the model that adhere to a desired dialogue flow or context.
Different models may use different kinds of chat template to format the conversations.
It is crucial to use the designated one to ensure the precise control over the LLM's generation process.
Qwen uses the following format (ChatML[^chatml]), making use of control tokens to format each turn in the conversations
```text
<|im_start|>{{role}}
{{content}}<|im_end|>
```
The user input take the role of `user` and the model generation takes the role of `assistant`.
Qwen also supports the meta message that instruct the model to perform specific actions or generate text with certain characteristics, such as altering tone, style, or content, which takes the role of `system` and the content defaults to "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
The following is a full example:
```text
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
Hello! How can I assist you today? Let me know if you have any questions or need help with something. I'm here to help!<|im_end|>
<|im_start|>user
Explain large language models like I'm 5.<|im_end|>
<|im_start|>assistant
Sure, I'd be happy to explain large language models in a simple way!
Imagine that you have a friend who is really good at playing with words. This friend has read so many books and talked to so many people that they know how to use words in all sorts of different ways. They can tell stories, answer questions, and even make up new words if they need to.
Now imagine that this friend is actually a computer program, called a "large language model". It's been trained on lots and lots of text, like books, articles, and conversations, so it knows how to use words just like your word-loving friend does. When you ask the model a question or give it a task, it uses all the knowledge it's learned to come up with a response that makes sense.
Just like your friend might sometimes make mistakes or say things in a funny way, the large language model isn't perfect either. But it's still really impressive, because it can understand and generate human language in a way that was once thought impossible for machines!<|im_end|><|endoftext|>
```
Starting from Qwen2.5, the Qwen model family including multimodal and specialized models will use a unified vocabulary, which contains control tokens from all subfamilies.
There are 22 control tokens in the vocabulary of Qwen2.5, making the vocabulary size totaling 151,665:
- 1 general: `<|endoftext|>`
- 2 for chat: `<|im_start|>` and `<|im_end|>`
- 2 for tool use: `<tool_call>` and `</tool_call>`
- 11 for vision
- 6 for coding
**Takeaway: Qwen uses ChatML with control tokens for chat template.**
[^chatml]: For historical reference only, ChatML is first described by the OpenAI Python SDK. The last available version is [this](https://github.com/openai/openai-python/blob/v0.28.1/chatml.md). Please also be aware that that document lists use cases intended for OpenAI models. For Qwen2.5 models, please only use as in our guide.
## Length Limit
As Qwen models are causal language models, in theory there is only one length limit of the entire sequence.
However, since there is often packing in training and each sequence may contain multiple individual pieces of texts.
**How long the model can generate or complete ultimately depends on the use case and in that case how long each document (for pre-training) or each turn (for post-training) is in training.**
For Qwen2.5, the packed sequence length in training is 32,768 tokens.[^yarn]
The maximum document length in pre-training is this length.
The maximum message length for user and assistant is different in post-training.
In general, the assistant message could be up to 8192 tokens.
[^yarn]: The sequence length can be extended to 131,072 tokens for Qwen2.5-7B, Qwen2.5-14B, Qwen2.5-32B, and Qwen2.5-72B models with YaRN.
Please refer to the model card on how to enable YaRN in vLLM.
**Takeaway: Qwen2.5 models can process texts of 32K or 128K tokens and up to 8K tokens can be assistant output.**
Performance of Quantized Models
==================================
.. attention::
To be updated for Qwen3.
This section reports the generation performance of quantized
models (including GPTQ and AWQ) of the Qwen2 series. Specifically, we
report:
* MMLU (Accuracy)
* C-Eval (Accuracy)
* IFEval (Strict Prompt-Level Accuracy)
We use greedy decoding in evaluating all models.
+---------------------+--------------+---------+-------+--------+--------+
| | Quantization | Average | MMLU | C-Eval | IFEval |
+=====================+==============+=========+=======+========+========+
| Qwen2-72B-Instruct | BF16 | 81.3 | 82.3 | 83.8 | 77.6 |
+ +--------------+---------+-------+--------+--------+
| | GPTQ-Int8 | 80.7 | 81.3 | 83.4 | 77.5 |
+ +--------------+---------+-------+--------+--------+
| | GPTQ-Int4 | 81.2 | 80.8 | 83.9 | 78.9 |
+ +--------------+---------+-------+--------+--------+
| | AWQ | 80.4 | 80.5 | 83.9 | 76.9 |
+---------------------+--------------+---------+-------+--------+--------+
| Qwen2-7B-Instruct | BF16 | 66.9 | 70.5 | 77.2 | 53.1 |
+ +--------------+---------+-------+--------+--------+
| | GPTQ-Int8 | 66.2 | 69.1 | 76.7 | 52.9 |
+ +--------------+---------+-------+--------+--------+
| | GPTQ-Int4 | 64.1 | 67.8 | 75.2 | 49.4 |
+ +--------------+---------+-------+--------+--------+
| | AWQ | 64.1 | 67.4 | 73.6 | 51.4 |
+---------------------+--------------+---------+-------+--------+--------+
| Qwen2-1.5B-Instruct | BF16 | 48.4 | 52.4 | 63.8 | 29.0 |
+ +--------------+---------+-------+--------+--------+
| | GPTQ-Int8 | 48.1 | 53.0 | 62.5 | 28.8 |
+ +--------------+---------+-------+--------+--------+
| | GPTQ-Int4 | 45.0 | 50.7 | 57.4 | 27.0 |
+ +--------------+---------+-------+--------+--------+
| | AWQ | 46.5 | 51.6 | 58.1 | 29.9 |
+---------------------+--------------+---------+-------+--------+--------+
| Qwen2-0.5B-Instruct | BF16 | 34.4 | 37.9 | 45.2 | 20.0 |
+ +--------------+---------+-------+--------+--------+
| | GPTQ-Int8 | 32.6 | 35.6 | 43.9 | 18.1 |
+ +--------------+---------+-------+--------+--------+
| | GPTQ-Int4 | 29.7 | 33.0 | 39.2 | 16.8 |
+ +--------------+---------+-------+--------+--------+
| | AWQ | 31.1 | 34.4 | 42.1 | 16.7 |
+---------------------+--------------+---------+-------+--------+--------+
# Quickstart
This guide helps you quickly start using Qwen3.
We provide examples of [Hugging Face Transformers](https://github.com/huggingface/transformers) as well as [ModelScope](https://github.com/modelscope/modelscope), and [vLLM](https://github.com/vllm-project/vllm) for deployment.
You can find Qwen3 models in [the Qwen3 collection](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) at HuggingFace Hub and [the Qwen3 collection](https://www.modelscope.cn/collections/Qwen3-9743180bdc6b48) at ModelScope.
## Transformers
To get a quick start with Qwen3, you can try the inference with `transformers` first.
Make sure that you have installed `transformers>=4.51.0`.
We advise you to use Python 3.10 or higher, and PyTorch 2.6 or higher.
The following is a very simple code snippet showing how to run Qwen3-8B:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-8B"
# load the tokenizer and the model
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parse thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
```
Qwen3 will think before respond, similar to QwQ models.
This means the model will use its reasoning abilities to enhance the quality of generated responses.
The model will first generate thinking content wrapped in a `<think>...</think>` block, followed by the final response.
- Hard Switch:
To strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models, you can set `enable_thinking=False` when formatting the text.
```python
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False, # Setting enable_thinking=False disables thinking mode
)
```
It can be particularly useful in scenarios where disabling thinking is essential for enhancing efficiency.
- Soft Switch:
Qwen3 also understands the user's instruction on its thinking behaviour, in particular, the soft switch `/think` and `/no_think`.
You can add them to user prompts or system messages to switch the model's thinking mode from turn to turn.
The model will follow the most recent instruction in multi-turn conversations.
:::{note}
For thinking mode, use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 (the default setting in `generation_config.json`).
DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.
For more detailed guidance, please refer to the Best Practices section.
For non-thinking mode, we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0.
:::
## ModelScope
To tackle with downloading issues, we advise you to try [ModelScope](https://github.com/modelscope/modelscope).
Before starting, you need to install `modelscope` with `pip`.
`modelscope` adopts a programmatic interface similar (but not identical) to `transformers`.
For basic usage, you can simply change the first line of code above to the following:
```python
from modelscope import AutoModelForCausalLM, AutoTokenizer
```
For more information, please refer to [the documentation of `modelscope`](https://www.modelscope.cn/docs).
## vLLM
To deploy Qwen3, we advise you to use vLLM.
vLLM is a fast and easy-to-use framework for LLM inference and serving.
In the following, we demonstrate how to build a OpenAI-API compatible API service with vLLM.
First, make sure you have installed `vllm>=0.8.5`.
Run the following code to build up a vLLM service.
Here we take Qwen3-8B as an example:
```bash
vllm serve Qwen/Qwen3-8B --enable-reasoning --reasoning-parser deepseek_r1
```
Then, you can use the [create chat interface](https://platform.openai.com/docs/api-reference/chat/completions/create) to communicate with Qwen:
::::{tab-set}
:::{tab-item} curl
```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
```
:::
:::{tab-item} Python
You can use the API client with the `openai` Python SDK as shown below:
```python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
temperature=0.6,
top_p=0.95,
top_k=20,
max_tokens=32768,
)
print("Chat response:", chat_response)
```
::::
While the soft switch is always available, the hard switch is also availabe in vLLM through the following configuration to the API call.
To disable thinking, use
::::{tab-set}
:::{tab-item} curl
```shell
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"max_tokens": 8192,
"presence_penalty": 1.5,
"chat_template_kwargs": {"enable_thinking": false}
}'
```
:::
:::{tab-item} Python
You can use the API client with the `openai` Python SDK as shown below:
```python
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[
{"role": "user", "content": "Give me a short introduction to large language models."},
],
temperature=0.7,
top_p=0.8,
top_k=20,
presence_penalty=1.5,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print("Chat response:", chat_response)
```
::::
## Next Step
Now, you can have fun with Qwen3 models.
Would love to know more about its usage?
Feel free to check other documents in this documentation.
Speed Benchmark
=========================
.. attention::
To be updated for Qwen3.
This section reports the speed performance of bf16 models, quantized models
(including GPTQ-Int4, GPTQ-Int8 and AWQ) of the Qwen2.5 series. Specifically, we
report the inference speed (tokens/s) as well as memory footprint (GB)
under the conditions of different context lengths.
The environment of the evaluation with huggingface transformers is:
- NVIDIA A100 80GB
- CUDA 12.1
- Pytorch 2.3.1
- Flash Attention 2.5.8
- Transformers 4.46.0
- AutoGPTQ 0.7.1+cu121 (Compiled from source code)
- AutoAWQ 0.2.6
The environment of the evaluation with vLLM is:
- NVIDIA A100 80GB
- CUDA 12.1
- vLLM 0.6.3
- Pytorch 2.4.0
- Flash Attention 2.6.3
- Transformers 4.46.0
Notes:
- We use the batch size of 1 and the least number of GPUs as
possible for the evaluation.
- We test the speed and memory of generating 2048 tokens with
the input lengths of 1, 6144, 14336, 30720, 63488, and 129024
tokens.
- For vLLM, the memory usage is not reported because it pre-allocates
all GPU memory. We use ``gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False``
by default.
- 0.5B (Transformer)
+-------------------------+--------------+--------------+---------+-----------------+----------------+---------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note |
+=========================+==============+==============+=========+=================+================+===========================+
| Qwen2.5-0.5B-Instruct | 1 | BF16 | 1 | 47.40 | 0.97 | |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | GPTQ-Int8 | 1 | 35.17 | 0.64 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | GPTQ-Int4 | 1 | 50.60 | 0.48 | |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | AWQ | 1 | 37.09 | 0.68 | |
+ +--------------+--------------+---------+-----------------+----------------+---------------------------+
| | 6144 | BF16 | 1 | 47.45 | 1.23 | |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | GPTQ-Int8 | 1 | 36.47 | 0.90 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | GPTQ-Int4 | 1 | 48.89 | 0.73 | |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | AWQ | 1 | 37.04 | 0.72 | |
+ +--------------+--------------+---------+-----------------+----------------+---------------------------+
| | 14336 | BF16 | 1 | 47.11 | 1.60 | |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | GPTQ-Int8 | 1 | 35.44 | 1.26 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | GPTQ-Int4 | 1 | 48.26 | 1.10 | |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | AWQ | 1 | 37.14 | 1.10 | |
+ +--------------+--------------+---------+-----------------+----------------+---------------------------+
| | 30720 | BF16 | 1 | 47.16 | 2.34 | |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | GPTQ-Int8 | 1 | 36.25 | 2.01 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | GPTQ-Int4 | 1 | 49.22 | 1.85 | |
+ + +--------------+---------+-----------------+----------------+---------------------------+
| | | AWQ | 1 | 36.90 | 1.84 | |
+-------------------------+--------------+--------------+---------+-----------------+----------------+---------------------------+
- 0.5B (vLLM)
+-------------------------+--------------+--------------+---------+-----------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) |
+=========================+==============+==============+=========+=================+
| Qwen2.5-0.5B-Instruct | 1 | BF16 | 1 | 311.55 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 257.07 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 260.93 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 261.95 |
+ +--------------+--------------+---------+-----------------+
| | 6144 | BF16 | 1 | 304.79 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 254.10 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 257.33 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 259.80 |
+ +--------------+--------------+---------+-----------------+
| | 14336 | BF16 | 1 | 290.28 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 243.69 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 247.01 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 249.58 |
+ +--------------+--------------+---------+-----------------+
| | 30720 | BF16 | 1 | 264.51 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 223.86 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 226.50 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 229.84 |
+-------------------------+--------------+--------------+---------+-----------------+
- 1.5B (Transformer)
+--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note |
+==========================+==============+==============+=========+=================+================+=========================+
| Qwen2.5-1.5B-Instruct | 1 | BF16 | 1 | 39.68 | 2.95 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 32.62 | 1.82 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 43.33 | 1.18 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 31.70 | 1.51 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 6144 | BF16 | 1 | 40.88 | 3.43 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 31.46 | 2.30 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 43.96 | 1.66 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 32.30 | 1.63 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 14336 | BF16 | 1 | 40.43 | 4.16 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 31.06 | 3.03 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 43.66 | 2.39 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 32.39 | 2.36 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 30720 | BF16 | 1 | 38.59 | 5.62 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 31.04 | 4.49 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 35.68 | 3.85 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 31.95 | 3.82 | |
+--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+
- 1.5B (vLLM)
+--------------------------+--------------+--------------+---------+-----------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) |
+==========================+==============+==============+=========+=================+
| Qwen2.5-1.5B-Instruct | 1 | BF16 | 1 | 183.33 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 201.67 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 217.03 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 213.74 |
+ +--------------+--------------+---------+-----------------+
| | 6144 | BF16 | 1 | 176.68 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 192.83 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 206.63 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 203.64 |
+ +--------------+--------------+---------+-----------------+
| | 14336 | BF16 | 1 | 168.69 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 183.69 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 195.88 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 192.64 |
+ +--------------+--------------+---------+-----------------+
| | 30720 | BF16 | 1 | 152.04 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 162.82 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 173.57 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 170.20 |
+--------------------------+--------------+--------------+---------+-----------------+
- 3B (Transformer)
+--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note |
+==========================+==============+==============+=========+=================+================+=========================+
| Qwen2.5-3B-Instruct | 1 | BF16 | 1 | 30.80 | 5.95 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 25.69 | 3.38 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 35.21 | 2.06 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 25.29 | 2.50 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 6144 | BF16 | 1 | 32.20 | 6.59 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 24.69 | 3.98 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 34.47 | 2.67 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 24.86 | 2.62 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 14336 | BF16 | 1 | 31.72 | 7.47 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 24.70 | 4.89 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 34.36 | 3.58 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 25.19 | 3.54 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 30720 | BF16 | 1 | 25.37 | 9.30 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 21.67 | 6.72 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 23.60 | 5.41 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 24.56 | 5.37 | |
+--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+
- 3B (vLLM)
+--------------------------+--------------+--------------+---------+-----------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) |
+==========================+==============+==============+=========+=================+
| Qwen2.5-3B-Instruct | 1 | BF16 | 1 | 127.61 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 150.02 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 168.20 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 165.50 |
+ +--------------+--------------+---------+-----------------+
| | 6144 | BF16 | 1 | 123.15 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 143.09 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 159.85 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 156.38 |
+ +--------------+--------------+---------+-----------------+
| | 14336 | BF16 | 1 | 117.35 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 135.50 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 149.35 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 147.75 |
+ +--------------+--------------+---------+-----------------+
| | 30720 | BF16 | 1 | 105.88 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int8 | 1 | 118.38 |
+ + +--------------+---------+-----------------+
| | | GPTQ-Int4 | 1 | 129.28 |
+ + +--------------+---------+-----------------+
| | | AWQ | 1 | 127.19 |
+--------------------------+--------------+--------------+---------+-----------------+
- 7B (Transformer)
+-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note |
+=============================+==============+==============+=========+=================+================+=========================+
| Qwen2.5-7B-Instruct | 1 | BF16 | 1 | 40.38 | 14.38 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 31.55 | 8.42 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 43.10 | 5.52 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 32.03 | 5.39 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 6144 | BF16 | 1 | 38.76 | 15.38 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 31.26 | 9.43 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 38.27 | 6.52 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 32.37 | 6.39 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 14336 | BF16 | 1 | 29.78 | 16.91 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 26.86 | 10.96 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 28.70 | 8.05 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 30.23 | 7.92 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 30720 | BF16 | 1 | 18.83 | 19.97 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 17.59 | 14.01 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 18.45 | 11.11 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 19.11 | 10.98 | |
+-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+
- 7B (vLLM)
+-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | Note |
+=============================+==============+==============+=========+=================+===========================================+
| Qwen2.5-7B-Instruct | 1 | BF16 | 1 | 84.28 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 122.01 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 154.05 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 148.10 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 6144 | BF16 | 1 | 80.70 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 112.38 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 141.98 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 137.64 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 14336 | BF16 | 1 | 77.69 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 105.25 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 129.35 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 124.91 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 30720 | BF16 | 1 | 70.33 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 90.71 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 108.30 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 104.66 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 63488 | BF16 | 1 | 50.86 | setting-64k |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 60.52 | setting-64k |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 67.97 | setting-64k |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 66.42 | setting-64k |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 129024 | BF16 | 1 | 28.94 | vllm==0.6.2, new sample config |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 25.97 | vllm==0.6.2, new sample config |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 26.37 | vllm==0.6.2, new sample config |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 26.57 | vllm==0.6.2, new sample config |
+-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+
* [Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False)
* [new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
- 14B (Transformer)
+--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note |
+==========================+==============+==============+=========+=================+================+=========================+
| Qwen2.5-14B-Instruct | 1 | BF16 | 1 | 24.74 | 28.08 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 18.84 | 16.11 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 25.89 | 9.94 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 19.23 | 9.79 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 6144 | BF16 | 1 | 20.51 | 29.50 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 17.80 | 17.61 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 20.06 | 11.36 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 19.21 | 11.22 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 14336 | BF16 | 1 | 13.92 | 31.95 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 12.66 | 19.98 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 13.79 | 13.81 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 14.17 | 13.67 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------+
| | 30720 | BF16 | 1 | 8.20 | 36.85 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int8 | 1 | 7.77 | 24.88 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | GPTQ-Int4 | 1 | 8.14 | 18.71 | |
+ + +--------------+---------+-----------------+----------------+-------------------------+
| | | AWQ | 1 | 8.31 | 18.57 | |
+--------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------+
- 14B (vLLM)
+-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | Note |
+=============================+==============+==============+=========+=================+===========================================+
| Qwen2.5-14B-Instruct | 1 | BF16 | 1 | 46.30 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 70.40 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 98.02 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 92.66 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 6144 | BF16 | 1 | 43.83 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 64.33 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 86.10 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 83.11 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 14336 | BF16 | 1 | 41.91 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 59.21 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 76.85 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 74.03 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 30720 | BF16 | 1 | 37.18 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 49.23 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 60.91 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 59.01 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 63488 | BF16 | 1 | 26.85 | setting-64k |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 32.83 | setting-64k |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 37.67 | setting-64k |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 36.71 | setting-64k |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 129024 | BF16 | 1 | 14.53 | vllm==0.6.2, new sample config |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 15.10 | vllm==0.6.2, new sample config |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 15.13 | vllm==0.6.2, new sample config |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 15.25 | vllm==0.6.2, new sample config |
+-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+
* [Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False)
* [new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
- 32B (Transformer)
+-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note |
+=============================+==============+==============+=========+=================+================+===========================================+
| Qwen2.5-32B-Instruct | 1 | BF16 | 1 | 17.54 | 61.58 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 14.52 | 33.56 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 19.20 | 18.94 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | AWQ | 1 | 14.60 | 18.67 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+
| | 6144 | BF16 | 1 | 12.49 | 63.72 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 11.61 | 35.86 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 13.42 | 21.09 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | AWQ | 1 | 13.81 | 20.81 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+
| | 14336 | BF16 | 1 | 8.95 | 67.31 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 8.53 | 39.28 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 9.48 | 24.67 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | AWQ | 1 | 9.71 | 24.39 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+
| | 30720 | BF16 | 1 | 5.59 | 74.47 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 5.42 | 46.45 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 5.79 | 31.84 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | AWQ | 1 | 5.85 | 31.56 | |
+-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+
- 32B (vLLM)
+-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | Note |
+=============================+==============+==============+=========+=================+===========================================+
| Qwen2.5-32B-Instruct | 1 | BF16 | 1 | 22.13 | setting1 |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 37.57 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 55.83 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 51.92 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 6144 | BF16 | 1 | 21.05 | setting1 |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 34.67 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 49.96 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 46.68 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 14336 | BF16 | 1 | 19.91 | setting1 |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 31.89 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 44.79 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 41.83 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 30720 | BF16 | 2 | 31.82 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 26.88 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 35.66 | |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 33.75 | |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 63488 | BF16 | 2 | 24.45 | setting-64k |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 18.60 | setting-64k |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 22.72 | setting-64k |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 21.79 | setting-64k |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 129024 | BF16 | 2 | 14.31 | vllm==0.6.2, new sample config |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 1 | 9.77 | vllm==0.6.2, new sample config |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 10.39 | vllm==0.6.2, new sample config |
+ + +--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 1 | 10.34 | vllm==0.6.2, new sample config |
+-----------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+
* For context length 129024, the model needs to be predicted with the following config: "model_max_length"=131072
* [Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
* [Setting 1]=(gpu_memory_utilization=1.0 max_model_len=32768 enforce_eager=True)
* [Setting-64k]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False)
* [new sample config]: for vLLM, set the following sampling parameters: SamplingParams(temperature=0.7,top_p=0.8,top_k=20,repetition_penalty=1,presence_penalty=0,frequency_penalty=0,max_tokens=out_length)
- 72B (Transformer)
+-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | GPU Memory(GB) | Note |
+=============================+==============+==============+=========+=================+================+===========================================+
| Qwen2.5-72B-Instruct | 1 | BF16 | 2 | 8.73 | 136.20 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int8 | 2 | 8.66 | 72.61 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 11.07 | 39.91 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | AWQ | 1 | 11.50 | 39.44 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+
| | 6144 | BF16 | 2 | 6.39 | 140.00 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int8 | 2 | 6.39 | 77.81 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 7.56 | 42.50 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | AWQ | 1 | 8.17 | 42.13 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+
| | 14336 | BF16 | 3 | 4.25 | 149.14 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int8 | 2 | 4.66 | 82.55 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 5.27 | 46.86 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | AWQ | 1 | 5.57 | 46.38 | |
+ +--------------+--------------+---------+-----------------+----------------+-------------------------------------------+
| | 30720 | BF16 | 3 | 2.94 | 164.79 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int8 | 2 | 2.94 | 94.75 | auto_gptq==0.6.0+cu1210 |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | GPTQ-Int4 | 2 | 3.14 | 62.57 | |
+ + +--------------+---------+-----------------+----------------+-------------------------------------------+
| | | AWQ | 2 | 3.23 | 61.64 | |
+-----------------------------+--------------+--------------+---------+-----------------+----------------+-------------------------------------------+
- 72B (vLLM)
+------------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+
| Model | Input Length | Quantization | GPU Num | Speed(tokens/s) | Note |
+==============================+==============+==============+=========+=================+===========================================+
| Qwen2.5-72B-Instruct | 1 | BF16 | 2 | 18.19 | Setting 1 |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | BF16 | 4 | 31.37 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 2 | 31.40 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 16.47 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 2 | 46.30 | Setting 2 |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 2 | 44.30 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 6144 | BF16 | 4 | 29.90 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 2 | 29.37 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 1 | 13.88 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 2 | 42.50 | Setting 3 |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 2 | 40.67 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 14336 | BF16 | 4 | 30.10 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 2 | 27.20 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 2 | 38.10 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 2 | 36.63 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 30720 | BF16 | 4 | 27.53 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 2 | 23.32 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 2 | 30.98 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 2 | 30.02 | Default |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 63488 | BF16 | 4 | 20.74 | Setting 4 |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 2 | 16.27 | Setting 4 |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 2 | 19.84 | Setting 4 |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 2 | 19.32 | Setting 4 |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | 129024 | BF16 | 4 | 12.68 | Setting 5 |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int8 | 4 | 14.11 | Setting 5 |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | GPTQ-Int4 | 2 | 10.11 | Setting 5 |
+ +--------------+--------------+---------+-----------------+-------------------------------------------+
| | | AWQ | 2 | 9.88 | Setting 5 |
+------------------------------+--------------+--------------+---------+-----------------+-------------------------------------------+
* [Default Setting]=(gpu_memory_utilization=0.9 max_model_len=32768 enforce_eager=False)
* [Setting 1]=(gpu_memory_utilization=0.98 max_model_len=4096 enforce_eager=True)
* [Setting 2]=(gpu_memory_utilization=1.0 max_model_len=4096 enforce_eager=True)
* [Setting 3]=(gpu_memory_utilization=1.0 max_model_len=8192 enforce_eager=True)
* [Setting 4]=(gpu_memory_utilization=0.9 max_model_len=65536 enforce_eager=False)
* [Setting 5]=(gpu_memory_utilization=0.9 max_model_len=131072 enforce_eager=False)
Welcome to Qwen!
================
.. figure:: https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/logo_qwen3.png
:width: 60%
:align: center
:alt: Qwen3
:class: no-scaled-link
Qwen is the large language model and large multimodal model series of the Qwen Team, Alibaba Group. Both language models and multimodal models are pretrained on large-scale multilingual and multimodal data and post-trained on quality data for aligning to human preferences.
Qwen is capable of natural language understanding, text generation, vision understanding, audio understanding, tool use, role play, playing as AI agent, etc.
The latest version, Qwen3, has the following features:
- **Dense and Mixture-of-Experts (MoE) models**, available in 0.6B, 1.7B, 4B, 8B, 14B, 32B and 30B-A3B, 235B-A22B.
- **Seamless switching between thinking mode** (for complex logical reasoning, math, and coding) and **non-thinking mode** (for efficient, general-purpose chat) **within a single model**, ensuring optimal performance across various scenarios.
- **Significantly enhancement in reasoning capabilities**, surpassing previous QwQ (in thinking mode) and Qwen2.5 instruct models (in non-thinking mode) on mathematics, code generation, and commonsense logical reasoning.
- **Superior human preference alignment**, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience.
- **Expertise in agent capabilities**, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks.
- **Support of 100+ languages and dialects** with strong capabilities for **multilingual instruction following** and **translation**.
For more information, please visit our:
* `Blog <https://qwenlm.github.io/>`__
* `GitHub <https://github.com/QwenLM>`__
* `Hugging Face <https://huggingface.co/Qwen>`__
* `ModelScope <https://modelscope.cn/organization/qwen>`__
* `Qwen3 Collection <https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f>`__
Join our community by joining our `Discord <https://discord.gg/yPEP2vHTu4>`__ and `WeChat <https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png>`__ group. We are looking forward to seeing you there!
.. toctree::
:maxdepth: 1
:caption: Getting Started
:hidden:
getting_started/quickstart
getting_started/concepts
getting_started/speed_benchmark
getting_started/quantization_benchmark
.. toctree::
:maxdepth: 1
:caption: Inference
:hidden:
inference/transformers
.. toctree::
:maxdepth: 1
:caption: Run Locally
:hidden:
run_locally/llama.cpp
run_locally/ollama
run_locally/mlx-lm
.. toctree::
:maxdepth: 1
:caption: Deployment
:hidden:
deployment/sglang
deployment/vllm
deployment/tgi
deployment/skypilot
deployment/openllm
.. toctree::
:maxdepth: 1
:caption: Quantization
:hidden:
quantization/awq
quantization/gptq
quantization/llama.cpp
.. toctree::
:maxdepth: 1
:caption: Training
:hidden:
training/llama_factory
training/ms_swift
.. toctree::
:maxdepth: 1
:caption: Framework
:hidden:
framework/function_call
framework/qwen_agent
framework/LlamaIndex
framework/Langchain
# Transformers
Transformers is a library of pretrained natural language processing for inference and training.
Developers can use Transformers to train models on their data, build inference applications, and generate texts with large language models.
## Environment Setup
- `transformers>=4.51.0`
- `torch>=2.6` is recommended
- GPU is recommended
## Basic Usage
You can use the `pipeline()` interface or the `generate()` interface to generate texts with Qwen3 in transformers.
In general, the pipeline interface requires less boilerplate code, which is shown here.
The following shows a basic example using pipeline for mult-iturn conversations:
```python
from transformers import pipeline
model_name_or_path = "Qwen/Qwen3-8B"
generator = pipeline(
"text-generation",
model_name_or_path,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "Give me a short introduction to large language model."},
]
messages = generator(messages, max_new_tokens=32768)[0]{"generated_text"}
# print(messages[-1]["content"])
messages.append({"role": "user", "content": "In a single sentence."})
messages = generator(messages, max_new_tokens=32768)[0]{"generated_text"}
# print(messages[-1]["content"])
```
There are some important parameters creating the pipeline:
- **Model**: `model_name_or_path` could be a model ID like `Qwen/Qwen3-8B` or a local path.
To download model files to a local directory, you could use
```shell
huggingface-cli download --local-dir ./Qwen3-8B Qwen/Qwen3-8B
```
You can also download model files using ModelScope if you are in mainland China
```shell
modelscope download --local_dir ./Qwen3-8B Qwen/Qwen3-8B
```
- **Device Placement**: `device_map="auto"` will load the model parameters to multiple devices automatically, if available.
It relies on the `accelerate` pacakge.
If you would like to use a single device, you can pass `device` instead of device_map.
`device=-1` or `device="cpu"` indicates using CPU, `device="cuda"` indicates using the current GPU, and `device="cuda:1"` or `device=1` indicates using the second GPU.
Do not use `device_map` and `device` at the same time!
- **Compute Precision**: `torch_dtype="auto"` will determine automatically the data type to use based on the original precision of the checkpoint and the precision your device supports.
For modern devices, the precision determined will be `bfloat16`.
If you don't pass `torch_dtype="auto"`, the default data type is `float32`, which will take double the memory and be slower in computation.
Calls to the text generation pipleine will use the generation configuration from the model file, e.g., `generation_config.json`.
Those configuration could be overridden by passing arguments directly to the call.
The default is equivalent to
```python
messages = generator(messages, do_sample=True, temperature=0.6, top_k=2, top_p=0.95, eos_token_id=[151645, 151643])[0]{"generated_text"}
```
For the best practices in configuring generation parameters, please see the model card.
## Thinking & Non-Thinking Mode
By default, Qwen3 model will think before response.
It is also true for the `pipeline()` interface.
To switch between thinking and non-thinking mode, two methods can be used
- Append a final assistant message, containing only `<think>\n\n</think>\n\n`.
This method is stateless, meaning it will only work for that single turn.
It will also strictly prevented the model from generating thinking content.
For example,
```python
messages = [
{"role": "user", "content": "Give me a short introduction to large language model."},
{"role": "assistant", "content": "<think>\n\n</think>\n\n"},
]
messages = generator(messages, max_new_tokens=32768)[0]{"generated_text"}
# print(messages[-1]["content"])
messages.append({"role": "user", "content": "In a single sentence."})
messages = generator(messages, max_new_tokens=32768)[0]{"generated_text"}
# print(messages[-1]["content"])
```
- Add to the user (or the system) message, `/no_think` to disable thinking and `/think` to enable thinking.
This method is stateful, meaning the model will follow the most recent instruction in multi-turn conversations.
You can also use instructions in natural language.
```python
messages = [
{"role": "user", "content": "Give me a short introduction to large language model./no_think"},
]
messages = generator(messages, max_new_tokens=32768)[0]{"generated_text"}
# print(messages[-1]["content"])
messages.append({"role": "user", "content": "In a single sentence./think"})
messages = generator(messages, max_new_tokens=32768)[0]{"generated_text"}
# print(messages[-1]["content"])
```
## Parsing Thinking Content
If you would like a more structured assistant message format, you can use the following function to extract the thinking content into a field named `reasoning_content` which is similar to the format used by vLLM, SGLang, etc.
```python
import copy
import re
def parse_thinking_content(messages):
messages = copy.deepcopy(messages)
for message in messages:
if message["role"] == "assistant" and (m := re.match(r"<think>\n(.+)</think>\n\n", message["content"], flags=re.DOTALL)):
message["content"] = message["content"][len(m.group(0)):]
if thinking_content := m.group(1).strip():
message["reasoning_content"] = thinking_content
return messages
```
## Parsing Tool Calls
For tool calling with Transformers, please refer to [our guide on Function Calling](../framework/function_call.md#hugging-face-transformers).
## Serving Quantized models
Qwen3 comes with two types of pre-quantized models, FP8 and AWQ.
The command serving those models are the same as the original models except for the name change:
```python
from transformers import pipeline
model_name_or_path = "Qwen/Qwen3-8B-FP8" # FP8 models
# model_name_or_path = "Qwen/Qwen3-8B-AWQ" # AWQ models
generator = pipeline(
"text-generation",
model_name_or_path,
torch_dtype="auto",
device_map="auto",
)
```
:::{note}
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9, that is, Ada Lovelace, Hopper, and later GPUs.
For better performance, make sure `triton` and a CUDA compiler compatible with the CUDA version of `torch` in your environment are installed.
:::
:::{important}
As of 4.51.0, there are issues with Tranformers when running those checkpoints **across GPUs**.
The following method could be used to work around those issues:
- Set the environmnt variable `CUDA_LAUNCH_BLOCKING=1` before running the script; or
- Uncomment [this line](https://github.com/huggingface/transformers/blob/0720e206c6ba28887e4d60ef60a6a089f6c1cc76/src/transformers/integrations/finegrained_fp8.py#L340) in your local installation of `transformers`.
:::
## Enabling Long Context
The maximum context length in pre-training for Qwen3 models is 32,768 tokens.
It can be extended to 131,072 tokens with RoPE scaling techniques.
We have validated the performance with YaRN.
Transformers supports YaRN, which can be enabled either by modifying the model files or overriding the default arguments when loading the model.
- Modifying the model files: In the config.json file, add the rope_scaling fields:
```json
{
...,
"rope_scaling": {
"type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768
}
}
```
- Overriding the default arguments:
```python
from transformers import pipeline
model_name_or_path = "Qwen/Qwen3-8B"
generator = pipeline(
"text-generation",
model_name_or_path,
torch_dtype="auto",
device_map="auto",
model_kwargs={
"rope_scaling": {
"type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768
}
}
)
```
:::{note}
Transformers implements static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
We advise adding the `rope_scaling` configuration only when processing long contexts is required.
It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
:::
## Streaming Generation
With the help of `TextStreamer`, you can modify your chatting with Qwen3 to streaming mode.
It will print the response as being generated to the console or the terminal.
```python
from transformers import pipeline, TextStreamer
model_name_or_path = "Qwen/Qwen3-8B"
generator = pipeline(
"text-generation",
model_name_or_path,
torch_dtype="auto",
device_map="auto",
)
streamer = TextStreamer(pipe.tokenizer, skip_prompt=True, skip_special_tokens=True)
messages= generator(messages, max_new_tokens=32768, streamer=streamer)[0]["generated_text"]
```
Besides using `TextStreamer`, we can also use `TextIteratorStreamer` which stores print-ready text in a queue, to be used by a downstream application as an iterator:
```python
from transformers import pipeline, TextIteratorStreamer
model_name_or_path = "Qwen/Qwen3-8B"
generator = pipeline(
"text-generation",
model_name_or_path,
torch_dtype="auto",
device_map="auto",
)
streamer = TextIteratorStreamer(pipe.tokenizer, skip_prompt=True, skip_special_tokens=True)
# Use Thread to run generation in background
# Otherwise, the process is blocked until generation is complete
# and no streaming effect can be observed.
from threading import Thread
generation_kwargs = dict(text_inputs=messages, max_new_tokens=32768, streamer=streamer)
thread = Thread(target=pipe, kwargs=generation_kwargs)
thread.start()
generated_text = ""
for new_text in streamer:
generated_text += new_text
print(generated_text)
```
## Batch Generation
:::{note}
Batching is not automatically a win for performance.
:::
```python
from transformers import pipeline
model_name_or_path = "Qwen/Qwen3-8B"
generator = pipeline(
"text-generation",
model_name_or_path,
torch_dtype="auto",
device_map="auto",
)
generator.tokenizer.padding_side="left"
batch = [
[{"role": "user", "content": "Give me a short introduction to large language model."}],
[{"role": "user", "content": "Give me a detailed introduction to large language model."}],
]
results = generator(batch, max_new_tokens=32768, batch_size=2)
batch = [result[0]["generated_text"] for result in results]
```
## FAQ
You may find distributed inference with Transformers is not as fast as you would imagine.
Transformers with `device_map="auto"` does not apply tensor parallelism and it only uses one GPU at a time.
For Transformers with tensor parallelism, please refer to [its documentation](https://huggingface.co/docs/transformers/v4.51.3/en/perf_infer_gpu_multi).
\ No newline at end of file
# AWQ
:::{attention}
To be updated for Qwen3.
:::
For quantized models, one of our recommendations is the usage of [AWQ](https://arxiv.org/abs/2306.00978) with [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
**AWQ** refers to Activation-aware Weight Quantization, a hardware-friendly approach for LLM low-bit weight-only quantization.
**AutoAWQ** is an easy-to-use Python library for 4-bit quantized models.
AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16.
AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs.
In this document, we show you how to use the quantized model with Hugging Face `transformers` and also how to quantize your own model.
## Usage of AWQ Models with Hugging Face transformers
Now, `transformers` has officially supported AutoAWQ, which means that you can directly use the quantized model with `transformers`.
The following is a very simple code snippet showing how to run `Qwen2.5-7B-Instruct-AWQ` with the quantized model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct-AWQ"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```
## Usage of AWQ Models with vLLM
vLLM has supported AWQ, which means that you can directly use our provided AWQ models or those quantized with `AutoAWQ` with vLLM.
We recommend using the latest version of vLLM (`vllm>=0.6.1`) which brings performance improvements to AWQ models; otherwise, the performance might not be well-optimized.
Actually, the usage is the same with the basic usage of vLLM.
We provide a simple example of how to launch OpenAI-API compatible API with vLLM and `Qwen2.5-7B-Instruct-AWQ`:
Run the following in a shell to start an OpenAI-compatible API service:
```bash
vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ
```
Then, you can call the API as
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
"messages": [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'
```
or you can use the API client with the `openai` Python package as shown below:
```python
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct-AWQ",
messages=[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
],
temperature=0.7,
top_p=0.8,
max_tokens=512,
extra_body={
"repetition_penalty": 1.05,
},
)
print("Chat response:", chat_response)
```
## Quantize Your Own Model with AutoAWQ
If you want to quantize your own model to AWQ quantized models, we advise you to use AutoAWQ.
```bash
pip install "autoawq<0.2.7"
```
Suppose you have finetuned a model based on `Qwen2.5-7B`, which is named `Qwen2.5-7B-finetuned`, with your own dataset, e.g., Alpaca.
To build your own AWQ quantized model, you need to use the training data for calibration.
Below, we provide a simple demonstration for you to run:
```python
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
# Specify paths and hyperparameters for quantization
model_path = "your_model_path"
quant_path = "your_quantized_model_path"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load your tokenizer and model with AutoAWQ
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", safetensors=True)
```
Then you need to prepare your data for calibration.
What you need to do is just put samples into a list, each of which is a text.
As we directly use our finetuning data for calibration, we first format it with ChatML template.
For example,
```python
data = []
for msg in dataset:
text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
data.append(text.strip())
```
where each `msg` is a typical chat message as shown below:
```json
[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me who you are."},
{"role": "assistant", "content": "I am a large language model named Qwen..."}
]
```
Then just run the calibration process by one line of code:
```python
model.quantize(tokenizer, quant_config=quant_config, calib_data=data)
```
Finally, save the quantized model:
```python
model.save_quantized(quant_path, safetensors=True, shard_size="4GB")
tokenizer.save_pretrained(quant_path)
```
Then you can obtain your own AWQ quantized model for deployment.
Enjoy!
# GPTQ
:::{attention}
To be updated for Qwen3.
:::
[GPTQ](https://arxiv.org/abs/2210.17323) is a quantization method for GPT-like LLMs, which uses one-shot weight quantization based on approximate second-order information.
In this document, we show you how to use the quantized model with Hugging Face `transformers` and also how to quantize your own model with [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ).
## Usage of GPTQ Models with Hugging Face transformers
:::{note}
To use the official Qwen2.5 GPTQ models with `transformers`, please ensure that `optimum>=1.20.0` and compatible versions of `transformers` and `auto_gptq` are installed.
You can do that by
```bash
pip install -U "optimum>=1.20.0"
```
:::
Now, `transformers` has officially supported AutoGPTQ, which means that you can directly use the quantized model with `transformers`.
For each size of Qwen2.5, we provide both Int4 and Int8 GPTQ quantized models.
The following is a very simple code snippet showing how to run `Qwen2.5-7B-Instruct-GPTQ-Int4`:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
```
## Usage of GPTQ Models with vLLM
vLLM has supported GPTQ, which means that you can directly use our provided GPTQ models or those trained with `AutoGPTQ` with vLLM.
If possible, it will automatically use the GPTQ Marlin kernel, which is more efficient.
Actually, the usage is the same with the basic usage of vLLM.
We provide a simple example of how to launch OpenAI-API compatible API with vLLM and `Qwen2.5-7B-Instruct-GPTQ-Int4`:
Run the following in a shell to start an OpenAI-compatible API service:
```bash
vllm serve Qwen2.5-7B-Instruct-GPTQ-Int4
```
Then, you can call the API as
```bash
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Qwen2.5-7B-Instruct-GPTQ-Int4",
"messages": [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'
```
or you can use the API client with the `openai` Python package as shown below:
```python
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="Qwen2.5-7B-Instruct-GPTQ-Int4",
messages=[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."},
],
temperature=0.7,
top_p=0.8,
max_tokens=512,
extra_body={
"repetition_penalty": 1.05,
},
)
print("Chat response:", chat_response)
```
## Quantize Your Own Model with AutoGPTQ
If you want to quantize your own model to GPTQ quantized models, we advise you to use AutoGPTQ.
It is suggested installing the latest version of the package by installing from source code:
```bash
git clone https://github.com/AutoGPTQ/AutoGPTQ
cd AutoGPTQ
pip install -e .
```
Suppose you have finetuned a model based on `Qwen2.5-7B`, which is named `Qwen2.5-7B-finetuned`, with your own dataset, e.g., Alpaca.
To build your own GPTQ quantized model, you need to use the training data for calibration.
Below, we provide a simple demonstration for you to run:
```python
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
# Specify paths and hyperparameters for quantization
model_path = "your_model_path"
quant_path = "your_quantized_model_path"
quantize_config = BaseQuantizeConfig(
bits=8, # 4 or 8
group_size=128,
damp_percent=0.01,
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
static_groups=False,
sym=True,
true_sequential=True,
model_name_or_path=None,
model_file_base_name="model"
)
max_len = 8192
# Load your tokenizer and model with AutoGPTQ
# To learn about loading model to multiple GPUs,
# visit https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/tutorial/02-Advanced-Model-Loading-and-Best-Practice.md
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config)
```
However, if you would like to load the model on multiple GPUs, you need to use `max_memory` instead of `device_map`.
Here is an example:
```python
model = AutoGPTQForCausalLM.from_pretrained(
model_path,
quantize_config,
max_memory={i: "20GB" for i in range(4)}
)
```
Then you need to prepare your data for calibration.
What you need to do is just put samples into a list, each of which is a text.
As we directly use our finetuning data for calibration, we first format it with ChatML template.
For example,
```python
import torch
data = []
for msg in dataset:
text = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=False)
model_inputs = tokenizer([text])
input_ids = torch.tensor(model_inputs.input_ids[:max_len], dtype=torch.int)
data.append(dict(input_ids=input_ids, attention_mask=input_ids.ne(tokenizer.pad_token_id)))
```
where each `msg` is a typical chat message as shown below:
```json
[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me who you are."},
{"role": "assistant", "content": "I am a large language model named Qwen..."}
]
```
Then just run the calibration process by one line of code:
```python
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)
model.quantize(data, cache_examples_on_gpu=False)
```
Finally, save the quantized model:
```python
model.save_quantized(quant_path, use_safetensors=True)
tokenizer.save_pretrained(quant_path)
```
It is unfortunate that the `save_quantized` method does not support sharding.
For sharding, you need to load the model and use `save_pretrained` from transformers to save and shard the model.
Except for this, everything is so simple.
Enjoy!
## Known Issues
### Qwen2.5-72B-Instruct-GPTQ-Int4 cannot stop generation properly
:Model: Qwen2.5-72B-Instruct-GPTQ-Int4
:Framework: vLLM, AutoGPTQ (including Hugging Face transformers)
:Description: Generation cannot stop properly. Continual generation after where it should stop, then repeated texts, either single character, a phrase, or paragraphs, are generated.
:Workaround: The following workaround could be considered
1. Using the original model in 16-bit floating point
2. Using the AWQ variants or llama.cpp-based models for reduced chances of abnormal generation
### Qwen2.5-32B-Instruct-GPTQ-Int4 broken with vLLM on multiple GPUs
:Model: Qwen2.5-32B-Instruct-GPTQ-Int4
:Framework: vLLM
:Description: Deployment on multiple GPUs and only garbled text like `!!!!!!!!!!!!!!!!!!` could be generated.
:Workaround: Each of the following workaround could be considered
1. Using the AWQ or GPTQ-Int8 variants
2. Using a single GPU
3. Using Hugging Face `transformers` if latency and throughput are not major concerns
## Troubleshooting
:::{dropdown} With `transformers` and `auto_gptq`, the logs suggest `CUDA extension not installed.` and the inference is slow.
`auto_gptq` fails to find a fused CUDA kernel compatible with your environment and falls back to a plain implementation.
Follow its [installation guide](https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md) to install a pre-built wheel or try installing `auto_gptq` from source.
:::
:::{dropdown} Self-quantized Qwen2.5-72B-Instruct-GPTQ with `vllm`, `ValueError: ... must be divisible by ...` is raised. The intermediate size of the self-quantized model is different from the official Qwen2.5-72B-Instruct-GPTQ models.
After quantization the size of the quantized weights are divided by the group size, which is typically 128.
The intermediate size for the FFN blocks in Qwen2.5-72B is 29568.
Unfortunately, {math}`29568 \div 128 = 231`.
Since the number of attention heads and the dimensions of the weights must be divisible by the tensor parallel size, it means you can only run the quantized model with `tensor_parallel_size=1`, i.e., one GPU card.
A workaround is to make the intermediate size divisible by {math}`128 \times 8 = 1024`.
To achieve that, the weights should be padded with zeros.
While it is mathematically equivalent before and after zero-padding the weights, the results may be slightly different in reality.
Try the following:
```python
import torch
from torch.nn import functional as F
from transformers import AutoModelForCausalLM
# must use AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-72B-Instruct", torch_dtype="auto")
# this size is Qwen2.5-72B only
pad_size = 128
sd = model.state_dict()
for i, k in enumerate(sd):
v = sd[k]
print(k, i)
# interleaving the padded zeros
if ('mlp.up_proj.weight' in k) or ('mlp.gate_proj.weight' in k):
prev_v = F.pad(v.unsqueeze(1), (0, 0, 0, 1, 0, 0)).reshape(29568*2, -1)[:pad_size*2]
new_v = torch.cat([prev_v, v[pad_size:]], dim=0)
sd[k] = new_v
elif 'mlp.down_proj.weight' in k:
prev_v= F.pad(v.unsqueeze(2), (0, 1)).reshape(8192, 29568*2)[:, :pad_size*2]
new_v = torch.cat([prev_v, v[:, pad_size:]], dim=1)
sd[k] = new_v
# this is a very large file; make sure your RAM is enough to load the model
torch.save(sd, '/path/to/padded_model/pytorch_model.bin')
```
This will save the padded checkpoint to the specified directory.
Then, copy other files from the original checkpoint to the new directory and modify the `intermediate_size` in `config.json` to `29696`.
Finally, you can quantize the saved model checkpoint.
:::
\ No newline at end of file
# llama.cpp
Quantization is a major topic for local inference of LLMs, as it reduces the memory footprint.
Undoubtably, llama.cpp natively supports LLM quantization and of course, with flexibility as always.
At high-level, all quantization supported by llama.cpp is weight quantization:
Model parameters are quantized into lower bits, and in inference, they are dequantized and used in computation.
In addition, you can mix different quantization data types in a single quantized model, e.g., you can quantize the embedding weights using a quantization data type and other weights using a different one.
With an adequate mixture of quantization types, much lower quantization error can be attained with just a slight increase of bit-per-weight.
The example program `llama-quantize` supports many quantization presets, such as Q4_K_M and Q8_0.
If you find the quantization errors still more than expected, you can bring your own scales, e.g., as computed by AWQ, or use calibration data to compute an importance matrix using `llama-imatrix`, which can then be used during quantization to enhance the quality of the quantized models.
In this document, we demonstrate the common way to quantize your model and evaluate the performance of the quantized model.
We will assume you have the example programs from llama.cpp at your hand.
If you don't, check our guide [here](../run_locally/llama.cpp.html#getting-the-program){.external}.
## Getting the GGUF
Now, suppose you would like to quantize `Qwen3-8B`.
You need to first make a GGUF file as shown below:
```bash
python convert-hf-to-gguf.py Qwen/Qwen3-8B --outfile qwen3-8b-f16.gguf
```
Sometimes, it may be better to use fp32 as the start point for quantization.
In that case, use
```bash
python convert-hf-to-gguf.py Qwen/Qwen3-8B-Instruct --outtype f32 --outfile qwen3-8b-f32.gguf
```
## Quantizing the GGUF without Calibration
For the simplest way, you can directly quantize the model to lower-bits based on your requirements.
An example of quantizing the model to 8 bits is shown below:
```bash
./llama-quantize qwen3-8b-f16.gguf qwen3-8b-q8_0.gguf Q8_0
```
`Q8_0` is a code for a quantization preset.
You can find all the presets in [the source code of `llama-quantize`](https://github.com/ggml-org/llama.cpp/blob/master/examples/quantize/quantize.cpp).
Look for the variable `QUANT_OPTIONS`.
Common ones used for 7B models include `Q8_0`, `Q5_0`, and `Q4_K_M`.
The letter case doesn't matter, so `q8_0` or `q4_K_m` are perfectly fine.
Now you can use the GGUF file of the quantized model with applications based on llama.cpp.
Very simple indeed.
However, the accuracy of the quantized model could be lower than expected occasionally, especially for lower-bit quantization.
The program may even prevent you from doing that.
There are several ways to improve quality of quantized models.
A common way is to use a calibration dataset in the target domain to identify the weights that really matter and quantize the model in a way that those weights have lower quantization errors, as introduced in the next two methods.
## Quantizing the GGUF with AWQ Scale
:::{attention}
To be updated for Qwen3.
:::
To improve the quality of your quantized models, one possible solution is to apply the AWQ scale, following [this script](https://github.com/casper-hansen/AutoAWQ/blob/main/docs/examples.md#gguf-export).
First, when you run `model.quantize()` with `autoawq`, remember to add `export_compatible=True` as shown below:
```python
...
model.quantize(
tokenizer,
quant_config=quant_config,
export_compatible=True
)
model.save_pretrained(quant_path)
...
```
The above code will not actually quantize the weights.
Instead, it adjusts weights based on a dataset so that they are "easier" to quantize.[^AWQ]
Then, when you run `convert-hf-to-gguf.py`, remember to replace the model path with the path to the new model:
```bash
python convert-hf-to-gguf.py <quant_path> --outfile qwen2.5-7b-instruct-f16-awq.gguf
```
Finally, you can quantize the model as in the last example:
```bash
./llama-quantize qwen2.5-7b-instruct-f16-awq.gguf qwen2.5-7b-instruct-q8_0.gguf Q8_0
```
In this way, it should be possible to achieve similar quality with lower bit-per-weight.
[^AWQ]: If you are interested in what this means, refer to [the AWQ paper](https://arxiv.org/abs/2306.00978).
Basically, important weights (called salient weights in the paper) are identified based on activations across data examples.
The weights are scaled accordingly such that the salient weights are protected even after quantization.
## Quantizing the GGUF with Importance Matrix
Another possible solution is to use the "important matrix"[^imatrix], following [this](https://github.com/ggml-org/llama.cpp/tree/master/examples/imatrix).
First, you need to compute the importance matrix data of the weights of a model (`-m`) using a calibration dataset (`-f`):
```bash
./llama-imatrix -m qwen3-8b-f16.gguf -f calibration-text.txt --chunk 512 -o qwen3-8b-imatrix.dat -ngl 80
```
The text is cut in chunks of length `--chunk` for computation.
Preferably, the text should be representative of the target domain.
The final results will be saved in a file named `qwen3-8b-imatrix.dat` (`-o`), which can then be used:
```bash
./llama-quantize --imatrix qwen3-8b-imatrix.data \
qwen3-8b-f16.gguf qwen3-8b-q4_k_m.gguf Q4_K_M
```
For lower-bit quantization mixtures for 1-bit or 2-bit, if you do not provide `--imatrix`, a helpful warning will be printed by `llama-quantize`.
[^imatrix]: Here, the importance matrix keeps record of how weights affect the output: the weight should be important is a slight change in its value causes huge difference in the results, akin to the [GPTQ](https://arxiv.org/abs/2210.17323) algorithm.
## Perplexity Evaluation
`llama.cpp` provides an example program for us to calculate the perplexity, which evaluate how unlikely the given text is to the model.
It should be mostly used for comparisons: the lower the perplexity, the better the model remembers the given text.
To do this, you need to prepare a dataset, say "wiki test"[^wiki].
You can download the dataset with:
```bash
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research -O wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip
```
Then you can run the test with the following command:
```bash
./llama-perplexity -m qwen3-8b-q8_0.gguf -f wiki.test.raw -ngl 80
```
Wait for some time and you will get the perplexity of the model.
There are some numbers of different kinds of quantization mixture [here](https://github.com/ggml-org/llama.cpp/blob/master/examples/perplexity/README.md).
It might be helpful to look at the difference and grab a sense of how that kind of quantization might perform.
[^wiki]: It is not a good evaluation dataset for instruct models though, but it is very common and easily accessible.
You probably want to use a dataset similar to your target domain.
## Finally
In this guide, we demonstrate how to conduct quantization and evaluate the perplexity with llama.cpp.
For more information, please visit the [llama.cpp GitHub repo](https://github.com/ggml-org/llama.cpp).
We usually quantize the fp16 model to 4, 5, 6, and 8-bit models with different quantization mixtures, but sometimes a particular mixture just does not work, so we don't provide those in our HuggingFace Hub.
However, others in the community may have success, so if you haven't found what you need in our repos, look around.
Enjoy your freshly quantized models!
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment