Before starting, let's first discuss what is llama.cpp and what you should expect, and why we say "use" llama.cpp, with "use" in quotes.
llama.cpp is essentially a different ecosystem with a different design philosophy that targets light-weight footprint, minimal external dependency, multi-platform, and extensive, flexible hardware support:
- Plain C/C++ implementation without external dependencies
- Support a wide variety of hardware:
- AVX, AVX2 and AVX512 support for x86_64 CPU
- Apple Silicon via Metal and Accelerate (CPU and GPU)
- Various quantization scheme for faster inference and reduced memory footprint
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
It's like the Python frameworks `torch`+`transformers` or `torch`+`vllm` but in C++.
However, this difference is crucial:
- Python is an interpreted language:
The code you write is executed line-by-line on-the-fly by an interpreter.
You can run the example code snippet or script with an interpreter or a natively interactive interpreter shell.
In addition, Python is learner friendly, and even if you don't know much before, you can tweak the source code here and there.
- C++ is a compiled language:
The source code you write needs to be compiled beforehand, and it is translated to machine code and an executable program by a compiler.
The overhead from the language side is minimal.
You do have source code for example programs showcasing how to use the library.
But it is not very easy to modify the source code if you are not verse in C++ or C.
To use llama.cpp means that you use the llama.cpp library in your own program, like writing the source code of [Ollama](https://ollama.com/), [LM Studio](https://lmstudio.ai/), [GPT4ALL](https://www.nomic.ai/gpt4all), [llamafile](https://llamafile.ai/) etc.
But that's not what this guide is intended or could do.
Instead, here we introduce how to use the `llama-cli` example program, in the hope that you know that llama.cpp does support Qwen2.5 models and how the ecosystem of llama.cpp generally works.
:::
In this guide, we will show how to "use" [llama.cpp](https://github.com/ggml-org/llama.cpp) to run models on your local machine, in particular, the `llama-cli` and the `llama-server` example program, which comes with the library.
The main steps are:
1. Get the programs
2. Get the Qwen3 models in GGUF[^GGUF] format
3. Run the program with the model
:::{note}
llama.cpp supports Qwen3 and Qwen3MoE from version `b5092`.
:::
## Getting the Program
You can get the programs in various ways.
For optimal efficiency, we recommend compiling the programs locally, so you get the CPU optimizations for free.
However, if you don't have C++ compilers locally, you can also install using package managers or downloading pre-built binaries.
They could be less efficient but for non-production example use, they are fine.
:::::{tab-set}
::::{tab-item} Compile Locally
Here, we show the basic command to compile `llama-cli` locally on **macOS** or **Linux**.
For Windows or GPU users, please refer to [the guide from llama.cpp](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md).
:::{rubric} Installing Build Tools
:heading-level: 5
:::
To build locally, a C++ compiler and a build system tool are required.
To see if they have been installed already, type `cc --version` or `cmake --version` in a terminal window.
- If installed, the build configuration of the tool will be printed to the terminal, and you are good to go!
- If errors are raised, you need to first install the related tools:
- On macOS, install with the command `xcode-select --install`
- On Ubuntu, install with the command `sudo apt install build-essential`.
For other Linux distributions, the command may vary; the essential packages needed for this guide are `gcc` and `cmake`.
:::{rubric} Compiling the Program
:heading-level: 5
:::
For the first step, clone the repo and enter the directory:
```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
```
Then, build llama.cpp using CMake:
```bash
cmake -B build
cmake --build build --config Release
```
The first command will check the local environment and determine which backends and features should be included.
The second command will actually build the programs.
To shorten the time, you can also enable parallel compiling based on the CPU cores you have, for example:
```bash
cmake --build build --config Release -j 8
```
This will build the programs with 8 parallel compiling jobs.
The built programs will be in `./build/bin/`.
::::
::::{tab-item} Package Managers
For **macOS** and **Linux** users, `llama-cli` and `llama-server` can be installed with package managers including Homebrew, Nix, and Flox.
Here, we show how to install `llama-cli` and `llama-server` with Homebrew.
For other package managers, please check the instructions [here](https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md).
Installing with Homebrew is very simple:
1. Ensure that Homebrew is available on your operating system.
If you don't have Homebrew, you can install it as in [its website](https://brew.sh/).
2. Second, you can install the pre-built binaries, `llama-cli` and `llama-server` included, with a single command:
```bash
brew install llama.cpp
```
Note that the installed binaries might not be built with the optimal compile options for your hardware, which can lead to poor performance.
They also don't support GPU on Linux systems.
::::
::::{tab-item} Binary Release
You can also download pre-built binaries from [GitHub Releases](https://github.com/ggml-org/llama.cpp/releases).
Please note that those pre-built binaries files are architecture-, backend-, and os-specific.
If you are not sure what those mean, you probably don't want to use them and running with incompatible versions will most likely fail or lead to poor performance.
The file name is like `llama-<version>-bin-<os>-<feature>-<arch>.zip`.
There are three simple parts:
-`<version>`: the version of llama.cpp. The latest is preferred, but as llama.cpp is updated and released frequently, the latest may contain bugs. If the latest version does not work, try the previous release until it works.
-`<os>`: the operating system. `win` for Windows; `macos` for macOS; `linux` for Linux.
-`<arch>`: the system architecture. `x64` for `x86_64`, e.g., most Intel and AMD systems, including Intel Mac; `arm64` for `arm64`, e.g., Apple Silicon or Snapdragon-based systems.
The `<feature>` part is somewhat complicated for Windows:
- Running on CPU
- x86_64 CPUs: We suggest try the `avx2` one first.
-`noavx`: No hardware acceleration at all.
-`avx2`, `avx`, `avx512`: SIMD-based acceleration. Most modern desktop CPUs should support avx2, and some CPUs support `avx512`.
-`openblas`: Relying on OpenBLAS for acceleration for prompt processing but not generation.
- arm64 CPUs: We suggest try the `llvm` one first.
-[`llvm` and `msvc`](https://github.com/ggml-org/llama.cpp/pull/7191) are different compilers
- Running on GPU: We suggest try the `cu<cuda_verison>` one for NVIDIA GPUs, `kompute` for AMD GPUs, and `sycl` for Intel GPUs first. Ensure that you have related drivers installed.
-[`vulcan`](https://github.com/ggml-org/llama.cpp/pull/2059): support certain NVIDIA and AMD GPUs
-[`kompute`](https://github.com/ggml-org/llama.cpp/pull/4456): support certain NVIDIA and AMD GPUs
-[`sycl`](https://github.com/ggml-org/llama.cpp/discussions/5138): Intel GPUs, oneAPI runtime is included
-`cu<cuda_verison>`: NVIDIA GPUs, CUDA runtime is not included. You can download the `cudart-llama-bin-win-cu<cuda_version>-x64.zip` and unzip it to the same directory if you don't have the corresponding CUDA toolkit installed.
You don't have much choice for macOS or Linux.
- Linux: only one prebuilt binary, `llama-<version>-bin-linux-x64.zip`, supporting CPU.
- macOS: `llama-<version>-bin-macos-x64.zip` for Intel Mac with no GPU support; `llama-<version>-bin-macos-arm64.zip` for Apple Silicon with GPU support.
After downloading the `.zip` file, unzip them into a directory and open a terminal at that directory.
::::
:::::
## Getting the GGUF
GGUF[^GGUF] is a file format for storing information needed to run a model, including but not limited to model weights, model hyperparameters, default generation configuration, and tokenizer.
You can use the official Qwen GGUFs from our HuggingFace Hub or prepare your own GGUF file.
### Using the Official Qwen3 GGUFs
We provide a series of GGUF models in our HuggingFace organization, and to search for what you need you can search the repo names with `-GGUF`.
Download the GGUF model that you want with `huggingface-cli` (you need to install it first with `pip install huggingface_hub`):
The first argument to the script refers to the path to the HF model directory or the HF model name, and the second argument refers to the path of your output GGUF file.
Remember to create the output directory before you run the command.
The fp16 model could be a bit heavy for running locally, and you can quantize the model as needed.
We introduce the method of creating and quantizing GGUF files in [this guide](../quantization/llama.cpp).
You can refer to that document for more information.
## Run Qwen with llama.cpp
:::{note}
Regarding switching between thinking and non-thinking modes,
while the soft switch is always available, the hard switch implemented in the chat template is not exposed in llama.cpp.
The quick workaround is to pass a custom chat template equivalennt to always `enable_thinking=False` via `--chat-template-file`.
:::
### llama-cli
[llama-cli](https://github.com/ggml-org/llama.cpp/tree/master/examples/main) is a console program which can be used to chat with LLMs.
Simple run the following command where you place the llama.cpp programs:
-**Model**: llama-cli supports using model files from local path, remote url, or HuggingFace hub.
-`-hf Qwen/Qwen3-8B-GGUF:Q8_0` in the above indicates we are using the model file from HuggingFace hub
- To use a local path, pass `-m qwen3-8b-q8_0.gguf` instead
- To use a remote url, pass `-mu https://hf.co/Qwen/Qwen3-8B-GGUF/resolve/main/qwen3-8b-Q8_0.gguf?download=true` instead
-**Speed Optimization**:
- CPU: llama-cli by default will use CPU and you can change `-t` to specify how many threads you would like it to use, e.g., `-t 8` means using 8 threads.
- GPU: If the programs are bulit with GPU support, you can use `-ngl`, which allows offloading some layers to the GPU for computation.
If there are multiple GPUs, it will offload to all the GPUs.
You can use `-dev` to control the devices used and `-sm` to control which kinds of parallelism is used.
For example, `-ngl 99 -dev cuda0,cuda1 -sm row` means offload all layers to GPU 0 and GPU1 using the split mode row.
Adding `-fa` may also speed up the generation.
-**Sampling Parameters**: llama.cpp supports [a variety of sampling methods](https://github.com/ggml-org/llama.cpp/tree/master/examples/main#generation-flags) and has default configuration for many of them.
It is recommended to adjust those parameters according to the actual case and the recommended parameters from Qwen3 modelcard could be used as a reference.
If you encounter repetition and endless generation, it is recommended to pass in addition `--presence-penalty` up to `2.0`.
-**Context Management**: llama.cpp adopts the "rotating" context management by default.
The `-c` controls the maximum context length (default 4096, 0 means loaded from model), and `-n` controls the maximum generation length each time (default -1 means infinite until ending, -2 means until context full).
When the context is full but the generation doesn't end, the first `--keep` tokens (default 0, -1 means all) from the initial prompt is kept, and the first half of the rest is discarded.
Then, the model continues to generate based on the new context tokens.
You can set `--no-context-shift` to prevent this rotating behaviour and the generation will stop once `-c` is reached.
llama.cpp supports YaRN, which can be enabled by `-c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768`.
-**Chat**: `--jinja` indicates using the chat template embedded in the GGUF which is prefered and `--color` indicates coloring the texts so that user input and model output can be better differentiated.
If there is a chat template, like in Qwen3 models, llama-cli will enter chat mode automatically.
To stop generation or exit press "Ctrl+C".
You can use `-sys` to add a system prompt.
### llama-server
[llama-server](https://github.com/ggml-org/llama.cpp/tree/master/examples/server) is a simple HTTP server, including a set of LLM REST APIs and a simple web front end to interact with LLMs using llama.cpp.
The core command is similar to that of llama-cli.
In addition, it supports thinking content parsing and tool call parsing.
[mlx-lm](https://github.com/ml-explore/mlx-examples/tree/main/llms) helps you run LLMs locally on Apple Silicon.
It is available at MacOS.
It has already supported Qwen models and this time, we have also provided checkpoints that you can directly use with it.
## Prerequisites
The easiest way to get started is to install the `mlx-lm` package:
- with `pip`:
```bash
pip install mlx-lm
```
- with `conda`:
```bash
conda install-c conda-forge mlx-lm
```
## Running with Qwen MLX Files
We provide model checkpoints with `mlx-lm` in our Hugging Face organization, and to search for what you need you can search the repo names with `-MLX`.
Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
# Multimodal format (supports images, audio, video)
{"messages": [
{"role": "user", "content": "<image>Describe this image"},
{"role": "assistant", "content": "<description>"}
], "images": ["/path/to/image.jpg"]}
For complete dataset formatting guidelines, see: `Custom Dataset Documentation <https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html>`__
Pre-built datasets are available at: `Supported Datasets <https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html#datasets>`__
The RL script in ms-swift has the following features:
- Support single-GPU and multi-GPU training
- Support full-parameter tuning, LoRA, Q-LoRA, and Dora
- Supports multiple RL algorithms including GRPO, DAPO, PPO, DPO, KTO, ORPO, CPO, and SimPO
- Supports both large language models (LLM) and multimodal models (MLLM)
For detailed support information, please refer to: `Supported Features <https://swift.readthedocs.io/en/latest/Instruction/Pre-training-and-Fine-tuning.html#pre-training-and-fine-tuning>`__
Environment Setup
++++++++++++++++++
1. Follow the instructions of `ms-swift <https://github.com/modelscope/ms-swift>`__, and build the environment.
2. Install these packages (Optional)::
pip install deepspeed
pip install math_verify==0.5.2
pip install flash-attn --no-build-isolation
pip install vllm
Data Preparation
++++++++++++++++
ms-swift has built-in preprocessing logic for several datasets, which can be directly used for training via the ``--dataset`` parameter. For supported datasets, please refer to: `Supported Datasets <https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html#datasets>`__
You can also use local custom datasets by providing the local dataset path to the ``--dataset`` parameter.
Example Dataset Formats:
.. code-block:: text
# llm
{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}
# mllm
{"messages": [{"role": "user", "content": "<image>What is the difference between the two images?"}], "images": ["/xxx/x.jpg"]}
{"messages": [{"role": "user", "content": "<image><image>What is the difference between the two images?"}], "images": ["/xxx/y.jpg", "/xxx/z.png"]}
Notes on Dataset Requirements
1. Reward Function Calculation: Depending on the reward function being used, additional columns may be required in the dataset. For example:
When using the built-in accuracy/cosine reward, the dataset must include a ``solution`` column to compute accuracy.
The other columns in the dataset will also be passed to the `kwargs` of the reward function.
2. Customizing the Reward Function: To tailor the reward function to your specific needs, you can refer to the following resource: `external reward plugin <https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/plugin>`__
Welcome to use Qwen2.5-Instruct model, type text to start chat, type :h to show command help.
(欢迎使用 Qwen2.5-Instruct 模型,输入内容即可进行对话,:h 显示命令帮助。)
Note: This demo is governed by the original license of Qwen2.5.
We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, including hate speech, violence, pornography, deception, etc.
"content": "Tell me something about large language models."
},
{
"role": "assistant",
"content": "Large language models are a type of language model that is trained on a large corpus of text data. They are capable of generating human-like text and are used in a variety of natural language processing tasks..."
},
{
"role": "user",
"content": "How about Qwen2?"
},
{
"role": "assistant",
"content": "Qwen2 is a large language model developed by Alibaba Cloud..."
This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
## 1. Model Collections
For models hosted on HuggingFace, refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
For models hosted on ModelScope, refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
> - For `flash-attention`, you can use the prebulit wheels from [GitHub Releases](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.8) or installing from source, which requires a compatible CUDA compiler.
> - You don't actually need to install `flash-attention`. It has been intergrated into `torch` as a backend of `sdpa`.
> - For `auto_gptq` to use efficent kernels, you need to install from source, because the prebuilt wheels require incompatible `torch` versions. Installing from source also requires a compatible CUDA compiler.
> - For `autoawq` to use efficent kenerls, you need `autoawq-kernels`, which should be automatically installed. If not, run `pip install autoawq-kernels`.
For inference using vLLM:
```shell
conda create -n qwen_perf_vllm python=3.10
conda activate qwen_perf_vllm
pip install-r requirements-perf-vllm.txt
```
## 3. Execute Tests
Below are two methods for executing tests: using a script or the Speed Benchmark tool.
### Method 1: Testing with Speed Benchmark Tool
Use the Speed Benchmark tool developed by [EvalScope](https://github.com/modelscope/evalscope), which supports automatic model downloads from ModelScope and outputs test results. It also allows testing by specifying the model service URL. For details, please refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/speed_benchmark.html).
**Install Dependencies**
```shell
pip install'evalscope[perf]'-U
```
#### HuggingFace Transformers Inference
Execute the command as follows:
```shell
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--parallel 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--attn-implementation flash_attention_2 \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--max-tokens 2048 \
--min-tokens 2048 \
--apilocal\
--dataset speed_benchmark
```
#### vLLM Inference
```shell
CUDA_VISIBLE_DEVICES=0 evalscope perf \
--parallel 1 \
--model Qwen/Qwen2.5-0.5B-Instruct \
--log-every-n-query 1 \
--connect-timeout 60000 \
--read-timeout 60000\
--max-tokens 2048 \
--min-tokens 2048 \
--api local_vllm \
--dataset speed_benchmark
```
#### Parameter Explanation
-`--parallel` sets the number of worker threads for concurrent requests, should be fixed at 1.
-`--model` specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct.
-`--attn-implementation` sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa.
-`--log-every-n-query`: sets how often to log every n requests.
-`--connect-timeout`: sets the connection timeout in seconds.
-`--read-timeout`: sets the read timeout in seconds.
-`--max-tokens`: sets the maximum output length in tokens.
-`--min-tokens`: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048.
-`--api`: sets the inference interface; local inference options are local|local_vllm.
-`--dataset`: sets the test dataset; options are speed_benchmark|speed_benchmark_long.
#### Test Results
Test results can be found in the `outputs/{model_name}/{timestamp}/speed_benchmark.json` file, which contains all request results and test parameters.
`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics
`--generate_length`: Number of tokens to generate; default is 2048
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace
`--outputs_dir`: Output directory, default is `outputs/transformers`
`--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section
`--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics
`--generate_length`: Number of tokens to generate; default is 2048
`--max_model_len`: Maximum model length in tokens; default is 32768
`--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`
`--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace
`--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9
`--outputs_dir`: Output directory, default is `outputs/vllm`
`--enforce_eager`: Whether to enforce eager mode; default is False
#### Test Results
Test results can be found in the `outputs` directory, which by default includes two folders for `transformers` and `vllm`, storing test results for HuggingFace transformers and vLLM respectively.
## Notes
1. Conduct multiple tests and take the average, with a typical value of 3 tests.
2. Ensure the GPU is idle before testing to avoid interference from other tasks.