v1.0

a52e53db · chenzk · a52e53db · a52e53db · a52e53db · a52e53db
Commit a52e53db authored Apr 29, 2025 by chenzk
20 changed files
--- a/docs/source/run_locally/llama.cpp.md
+++ b/docs/source/run_locally/llama.cpp.md
+# llama.cpp
+[^GGUF]: GPT-Generated Unified Format
+:::{dropdown} llama.cpp as a C++ library
+Before starting, let's first discuss what is llama.cpp and what you should expect, and why we say "use" llama.cpp, with "use" in quotes.
+llama.cpp is essentially a different ecosystem with a different design philosophy that targets light-weight footprint, minimal external dependency, multi-platform, and extensive, flexible hardware support:
+- Plain C/C++ implementation without external dependencies
+- Support a wide variety of hardware:
+  - AVX, AVX2 and AVX512 support for x86_64 CPU
+  - Apple Silicon via Metal and Accelerate (CPU and GPU)
+  - NVIDIA GPU (via CUDA), AMD GPU (via hipBLAS), Intel GPU (via SYCL), Ascend NPU (via CANN), and Moore Threads GPU (via MUSA)
+  - Vulkan backend for GPU
+- Various quantization scheme for faster inference and reduced memory footprint
+- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
+It's like the Python frameworks `torch`+`transformers` or `torch`+`vllm` but in C++.
+However, this difference is crucial: 
+- Python is an interpreted language: 
+  The code you write is executed line-by-line on-the-fly by an interpreter. 
+  You can run the example code snippet or script with an interpreter or a natively interactive interpreter shell.
+  In addition, Python is learner friendly, and even if you don't know much before, you can tweak the source code here and there.
+- C++ is a compiled language: 
+  The source code you write needs to be compiled beforehand, and it is translated to machine code and an executable program by a compiler.
+  The overhead from the language side is minimal. 
+  You do have source code for example programs showcasing how to use the library. 
+  But it is not very easy to modify the source code if you are not verse in C++ or C.
+To use llama.cpp means that you use the llama.cpp library in your own program, like writing the source code of [Ollama](https://ollama.com/), [LM Studio](https://lmstudio.ai/), [GPT4ALL](https://www.nomic.ai/gpt4all), [llamafile](https://llamafile.ai/) etc.
+But that's not what this guide is intended or could do.
+Instead, here we introduce how to use the `llama-cli` example program, in the hope that you know that llama.cpp does support Qwen2.5 models and how the ecosystem of llama.cpp generally works.
+:::
+In this guide, we will show how to "use" [llama.cpp](https://github.com/ggml-org/llama.cpp) to run models on your local machine, in particular, the `llama-cli` and the `llama-server` example program, which comes with the library.
+The main steps are:
+1. Get the programs
+2. Get the Qwen3 models in GGUF[^GGUF] format
+3. Run the program with the model
+:::{note}
+llama.cpp supports Qwen3 and Qwen3MoE from version `b5092`.
+:::
+## Getting the Program
+You can get the programs in various ways. 
+For optimal efficiency, we recommend compiling the programs locally, so you get the CPU optimizations for free.
+However, if you don't have C++ compilers locally, you can also install using package managers or downloading pre-built binaries. 
+They could be less efficient but for non-production example use, they are fine.
+:::::{tab-set}
+::::{tab-item} Compile Locally
+Here, we show the basic command to compile `llama-cli` locally on **macOS** or **Linux**.
+For Windows or GPU users, please refer to [the guide from llama.cpp](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md).
+:::{rubric} Installing Build Tools
+:heading-level: 5
+:::
+To build locally, a C++ compiler and a build system tool are required. 
+To see if they have been installed already, type `cc --version` or `cmake --version` in a terminal window.
+- If installed, the build configuration of the tool will be printed to the terminal, and you are good to go!
+- If errors are raised, you need to first install the related tools:
+  - On macOS, install with the command `xcode-select --install`
+  - On Ubuntu, install with the command `sudo apt install build-essential`. 
+    For other Linux distributions, the command may vary; the essential packages needed for this guide are `gcc` and `cmake`.
+:::{rubric} Compiling the Program
+:heading-level: 5
+:::
+For the first step, clone the repo and enter the directory:
+```bash
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp
+```
+Then, build llama.cpp using CMake:
+```bash
+cmake -B build
+cmake --build build --config Release
+```
+The first command will check the local environment and determine which backends and features should be included.
+The second command will actually build the programs.
+To shorten the time, you can also enable parallel compiling based on the CPU cores you have, for example:
+```bash
+cmake --build build --config Release -j 8
+```
+This will build the programs with 8 parallel compiling jobs.
+The built programs will be in `./build/bin/`.
+::::
+::::{tab-item} Package Managers
+For **macOS** and **Linux** users, `llama-cli` and `llama-server` can be installed with package managers including Homebrew, Nix, and Flox.
+Here, we show how to install `llama-cli` and `llama-server` with Homebrew. 
+For other package managers, please check the instructions [here](https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md).
+Installing with Homebrew is very simple:
+1. Ensure that Homebrew is available on your operating system. 
+   If you don't have Homebrew, you can install it as in [its website](https://brew.sh/).
+2. Second, you can install the pre-built binaries, `llama-cli` and `llama-server` included, with a single command:
+   ```bash
+   brew install llama.cpp
+   ```
+Note that the installed binaries might not be built with the optimal compile options for your hardware, which can lead to poor performance.
+They also don't support GPU on Linux systems.
+::::
+::::{tab-item} Binary Release
+You can also download pre-built binaries from [GitHub Releases](https://github.com/ggml-org/llama.cpp/releases).
+Please note that those pre-built binaries files are architecture-, backend-, and os-specific. 
+If you are not sure what those mean, you probably don't want to use them and running with incompatible versions will most likely fail or lead to poor performance.
+The file name is like `llama-<version>-bin-<os>-<feature>-<arch>.zip`.
+There are three simple parts:
+- `<version>`: the version of llama.cpp. The latest is preferred, but as llama.cpp is updated and released frequently, the latest may contain bugs. If the latest version does not work, try the previous release until it works.
+- `<os>`: the operating system. `win` for Windows; `macos` for macOS; `linux` for Linux.
+- `<arch>`: the system architecture. `x64` for `x86_64`, e.g., most Intel and AMD systems, including Intel Mac; `arm64` for `arm64`, e.g., Apple Silicon or Snapdragon-based systems.
+The `<feature>` part is somewhat complicated for Windows:
+- Running on CPU
+  - x86_64 CPUs: We suggest try the `avx2` one first.
+    - `noavx`: No hardware acceleration at all.
+    - `avx2`, `avx`, `avx512`: SIMD-based acceleration. Most modern desktop CPUs should support avx2, and some CPUs support `avx512`.
+    - `openblas`: Relying on OpenBLAS for acceleration for prompt processing but not generation.
+  - arm64 CPUs: We suggest try the `llvm` one first.
+    - [`llvm` and `msvc`](https://github.com/ggml-org/llama.cpp/pull/7191) are different compilers
+- Running on GPU: We suggest try the `cu<cuda_verison>` one for NVIDIA GPUs, `kompute` for AMD GPUs, and `sycl` for Intel GPUs first. Ensure that you have related drivers installed.
+  - [`vulcan`](https://github.com/ggml-org/llama.cpp/pull/2059): support certain NVIDIA and AMD GPUs
+  - [`kompute`](https://github.com/ggml-org/llama.cpp/pull/4456): support certain NVIDIA and AMD GPUs
+  - [`sycl`](https://github.com/ggml-org/llama.cpp/discussions/5138): Intel GPUs, oneAPI runtime is included
+  - `cu<cuda_verison>`: NVIDIA GPUs, CUDA runtime is not included. You can download the `cudart-llama-bin-win-cu<cuda_version>-x64.zip` and unzip it to the same directory if you don't have the corresponding CUDA toolkit installed.
+You don't have much choice for macOS or Linux.
+- Linux: only one prebuilt binary, `llama-<version>-bin-linux-x64.zip`, supporting CPU.
+- macOS: `llama-<version>-bin-macos-x64.zip` for Intel Mac with no GPU support; `llama-<version>-bin-macos-arm64.zip` for Apple Silicon with GPU support.
+After downloading the `.zip` file, unzip them into a directory and open a terminal at that directory.
+::::
+:::::
+## Getting the GGUF
+GGUF[^GGUF] is a file format for storing information needed to run a model, including but not limited to model weights, model hyperparameters, default generation configuration, and tokenizer.
+You can use the official Qwen GGUFs from our HuggingFace Hub or prepare your own GGUF file.
+### Using the Official Qwen3 GGUFs
+We provide a series of GGUF models in our HuggingFace organization, and to search for what you need you can search the repo names with `-GGUF`. 
+Download the GGUF model that you want with `huggingface-cli` (you need to install it first with `pip install huggingface_hub`):
+```bash
+huggingface-cli download <model_repo> <gguf_file> --local-dir <local_dir>
+```
+For example:
+```bash
+huggingface-cli download Qwen/Qwen3-8B-GGUF qwen3-8b-q4_k_m.gguf --local-dir .
+```
+This will download the Qwen3-8B model in GGUF format quantized with the scheme Q4_K_M.
+### Preparing Your Own GGUF
+Model files from HuggingFace Hub can be converted to GGUF, using the `convert-hf-to-gguf.py` Python script.
+It does require you to have a working Python environment with at least `transformers` installed.
+Obtain the source file if you haven't already:
+```bash
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp
+```
+Suppose you would like to use Qwen3-8B you can make a GGUF file for the fp16 model as shown below:
+```bash
+python convert-hf-to-gguf.py Qwen/Qwen3-8B --outfile qwen3-8b-f16.gguf
+```
+The first argument to the script refers to the path to the HF model directory or the HF model name, and the second argument refers to the path of your output GGUF file.
+Remember to create the output directory before you run the command.
+The fp16 model could be a bit heavy for running locally, and you can quantize the model as needed.
+We introduce the method of creating and quantizing GGUF files in [this guide](../quantization/llama.cpp). 
+You can refer to that document for more information.
+## Run Qwen with llama.cpp
+:::{note}
+Regarding switching between thinking and non-thinking modes,
+while the soft switch is always available, the hard switch implemented in the chat template is not exposed in llama.cpp.
+The quick workaround is to pass a custom chat template equivalennt to always `enable_thinking=False` via `--chat-template-file`.
+:::
+### llama-cli
+[llama-cli](https://github.com/ggml-org/llama.cpp/tree/master/examples/main) is a console program which can be used to chat with LLMs.
+Simple run the following command where you place the llama.cpp programs:
+```shell
+./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift
+```
+Here are some explanations to the above command:
+-   **Model**: llama-cli supports using model files from local path, remote url, or HuggingFace hub.
+    -   `-hf Qwen/Qwen3-8B-GGUF:Q8_0` in the above indicates we are using the model file from HuggingFace hub
+    -   To use a local path, pass `-m qwen3-8b-q8_0.gguf` instead
+    -   To use a remote url, pass `-mu https://hf.co/Qwen/Qwen3-8B-GGUF/resolve/main/qwen3-8b-Q8_0.gguf?download=true` instead
+-   **Speed Optimization**: 
+    - CPU: llama-cli by default will use CPU and you can change `-t` to specify how many threads you would like it to use, e.g., `-t 8` means using 8 threads.
+    - GPU: If the programs are bulit with GPU support, you can use `-ngl`, which allows offloading some layers to the GPU for computation.
+    If there are multiple GPUs, it will offload to all the GPUs.
+    You can use `-dev` to control the devices used and `-sm` to control which kinds of parallelism is used.
+    For example, `-ngl 99 -dev cuda0,cuda1 -sm row` means offload all layers to GPU 0 and GPU1 using the split mode row. 
+    Adding `-fa` may also speed up the generation.
+-   **Sampling Parameters**: llama.cpp supports [a variety of sampling methods](https://github.com/ggml-org/llama.cpp/tree/master/examples/main#generation-flags) and has default configuration for many of them.
+    It is recommended to adjust those parameters according to the actual case and the recommended parameters from Qwen3 modelcard could be used as a reference.
+    If you encounter repetition and endless generation, it is recommended to pass in addition `--presence-penalty` up to `2.0`.
+-   **Context Management**: llama.cpp adopts the "rotating" context management by default.
+    The `-c` controls the maximum context length (default 4096, 0 means loaded from model), and `-n` controls the maximum generation length each time (default -1 means infinite until ending, -2 means until context full).
+    When the context is full but the generation doesn't end, the first `--keep` tokens (default 0, -1 means all) from the initial prompt is kept, and the first half of the rest is discarded.
+    Then, the model continues to generate based on the new context tokens.
+    You can set `--no-context-shift` to prevent this rotating behaviour and the generation will stop once `-c` is reached.
+    llama.cpp supports YaRN, which can be enabled by `-c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768`.
+-   **Chat**: `--jinja` indicates using the chat template embedded in the GGUF which is prefered and `--color` indicates coloring the texts so that user input and model output can be better differentiated.
+    If there is a chat template, like in Qwen3 models, llama-cli will enter chat mode automatically.
+    To stop generation or exit press "Ctrl+C".
+    You can use `-sys` to add a system prompt.
+### llama-server
+[llama-server](https://github.com/ggml-org/llama.cpp/tree/master/examples/server) is a simple HTTP server, including a set of LLM REST APIs and a simple web front end to interact with LLMs using llama.cpp.
+The core command is similar to that of llama-cli.
+In addition, it supports thinking content parsing and tool call parsing.
+```shell
+./llama-server -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --reasoning-format deepseek -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift
+```
+By default the server will listen at `http://localhost:8080` which can be changed by passing `--host` and `--port`.
+The web front end can be assess from a browser at `http://localhost:8080/`.
+The OpenAI compatible API is at `http://localhost:8080/v1/`.
+## What's More
+If you still find it difficult to use llama.cpp, don't worry, just check out other llama.cpp-based applications.
+For example, Qwen3 has already been officially part of Ollama and LM Studio, which are platforms for your to search and run local LLMs. 
+Have fun!
--- a/docs/source/run_locally/mlx-lm.md
+++ b/docs/source/run_locally/mlx-lm.md
+# MLX-LM
+:::{attention}
+To be updated for Qwen3.
+:::
+[mlx-lm](https://github.com/ml-explore/mlx-examples/tree/main/llms) helps you run LLMs locally on Apple Silicon. 
+It is available at MacOS. 
+It has already supported Qwen models and this time, we have also provided checkpoints that you can directly use with it.
+## Prerequisites
+The easiest way to get started is to install the `mlx-lm` package:
+- with `pip`:
+  ```bash
+  pip install mlx-lm
+  ```
+- with `conda`:
+  ```bash
+  conda install -c conda-forge mlx-lm
+  ```
+## Running with Qwen MLX Files
+We provide model checkpoints with `mlx-lm` in our Hugging Face organization, and to search for what you need you can search the repo names with `-MLX`.
+Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
+```python
+from mlx_lm import load, generate
+model, tokenizer = load('Qwen/Qwen2.5-7B-Instruct-MLX', tokenizer_config={"eos_token": "<|im_end|>"})
+prompt = "Give me a short introduction to large language model."
+messages = [
+    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+response = generate(model, tokenizer, prompt=text, verbose=True, top_p=0.8, temp=0.7, repetition_penalty=1.05, max_tokens=512)
+```
+## Make Your MLX files
+You can make mlx files with just one command:
+```bash
+mlx_lm.convert --hf-path Qwen/Qwen2.5-7B-Instruct --mlx-path mlx/Qwen2.5-7B-Instruct/ -q
+```
+where
+- `--hf-path`: the model name on Hugging Face Hub or the local path
+- `--mlx-path`: the path for output files
+- `-q`: enable quantization
--- a/docs/source/run_locally/ollama.md
+++ b/docs/source/run_locally/ollama.md
+# Ollama
+:::{attention}
+To be updated for Qwen3.
+:::
+[Ollama](https://ollama.com/) helps you run LLMs locally with only a few commands.
+It is available at MacOS, Linux, and Windows.
+Now, Qwen2.5 is officially on Ollama, and you can run it with one command:
+```bash
+ollama run qwen2.5
+```
+Next, we introduce more detailed usages of Ollama for running Qwen2.5 models.
+## Quickstart
+Visit the official website [Ollama](https://ollama.com/) and click download to install Ollama on your device.
+You can also search models on the website, where you can find the Qwen2.5 models.
+Except for the default one, you can choose to run Qwen2.5-Instruct models of different sizes by:
+- `ollama run qwen2.5:0.5b`
+- `ollama run qwen2.5:1.5b`
+- `ollama run qwen2.5:3b`
+- `ollama run qwen2.5:7b`
+- `ollama run qwen2.5:14b`
+- `ollama run qwen2.5:32b`
+- `ollama run qwen2.5:72b`
+:::{note}
+`ollama` does not host base models.
+Even though the tag may not have the instruct suffix, they are all instruct models.
+:::
+## Run Ollama with Your GGUF Files
+Sometimes you don't want to pull models and you just want to use Ollama with your own GGUF files.
+Suppose you have a GGUF file of Qwen2.5, `qwen2.5-7b-instruct-q5_0.gguf`.
+For the first step, you need to create a file called `Modelfile`.
+The content of the file is shown below:
+```text
+FROM qwen2.5-7b-instruct-q5_0.gguf
+# set the temperature to 1 [higher is more creative, lower is more coherent]
+PARAMETER temperature 0.7
+PARAMETER top_p 0.8
+PARAMETER repeat_penalty 1.05
+PARAMETER top_k 20
+TEMPLATE """{{ if .Messages }}
+{{- if or .System .Tools }}<|im_start|>system
+{{ .System }}
+{{- if .Tools }}
+# Tools
+You are provided with function signatures within <tools></tools> XML tags:
+<tools>{{- range .Tools }}
+{"type": "function", "function": {{ .Function }}}{{- end }}
+</tools>
+For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
+<tool_call>
+{"name": <function-name>, "arguments": <args-json-object>}
+</tool_call>
+{{- end }}<|im_end|>
+{{ end }}
+{{- range $i, $_ := .Messages }}
+{{- $last := eq (len (slice $.Messages $i)) 1 -}}
+{{- if eq .Role "user" }}<|im_start|>user
+{{ .Content }}<|im_end|>
+{{ else if eq .Role "assistant" }}<|im_start|>assistant
+{{ if .Content }}{{ .Content }}
+{{- else if .ToolCalls }}<tool_call>
+{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
+{{ end }}</tool_call>
+{{- end }}{{ if not $last }}<|im_end|>
+{{ end }}
+{{- else if eq .Role "tool" }}<|im_start|>user
+<tool_response>
+{{ .Content }}
+</tool_response><|im_end|>
+{{ end }}
+{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
+{{ end }}
+{{- end }}
+{{- else }}
+{{- if .System }}<|im_start|>system
+{{ .System }}<|im_end|>
+{{ end }}{{ if .Prompt }}<|im_start|>user
+{{ .Prompt }}<|im_end|>
+{{ end }}<|im_start|>assistant
+{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}"""
+# set the system message
+SYSTEM """You are Qwen, created by Alibaba Cloud. You are a helpful assistant."""
+```
+Then create the ollama model by running:
+```bash
+ollama create qwen2.5_7b -f Modelfile
+```
+Once it is finished, you can run your ollama model by:
+```bash
+ollama run qwen2.5_7b
+```
+## Tool Use
+Tool use is now support Ollama and you should be able to run Qwen2.5 models with it.
+For more details, see our [function calling guide](../framework/function_call).
\ No newline at end of file
--- a/docs/source/training/llama_factory.rst
+++ b/docs/source/training/llama_factory.rst
+LLaMA-Factory
+===================================
+.. attention:: 
+    To be updated for Qwen3.
+Here we provide a script for supervised finetuning Qwen2.5 with
+`LLaMA-Factory <https://github.com/hiyouga/LLaMA-Factory>`__. This
+script for supervised finetuning (SFT) has the following features:
+-  Support single-GPU and multi-GPU training;
+-  Support full-parameter tuning, LoRA, Q-LoRA, Dora.
+In the following, we introduce more details about the usage of the
+script.
+Installation
+------------
+Before you start, make sure you have installed the following packages:
+1. Follow the instructions of
+   `LLaMA-Factory <https://github.com/hiyouga/LLaMA-Factory>`__, and build
+   the environment.
+2. Install these packages (Optional):
+::
+   pip install deepspeed
+   pip install flash-attn --no-build-isolation
+3. If you want to use
+   `FlashAttention-2 <https://github.com/Dao-AILab/flash-attention>`__,
+   make sure your CUDA is 11.6 and above.
+Data Preparation
+----------------
+LLaMA-Factory provides several training datasets in ``data`` folder, you
+can use it directly. If you are using a custom dataset, please prepare
+your dataset as follows.
+1. Organize your data in a **json** file and put your data in ``data``
+   folder. LLaMA-Factory supports dataset in ``alpaca`` or ``sharegpt``
+   format.
+-  The dataset in ``alpaca`` format should follow the below format:
+.. code:: json
+   [
+     {
+       "instruction": "user instruction (required)",
+       "input": "user input (optional)",
+       "output": "model response (required)",
+       "system": "system prompt (optional)",
+       "history": [
+         ["user instruction in the first round (optional)", "model response in the first round (optional)"],
+         ["user instruction in the second round (optional)", "model response in the second round (optional)"]
+       ]
+     }
+   ]
+-  The dataset in ``sharegpt`` format should follow the below format:
+.. code:: json
+   [
+     {
+       "conversations": [
+         {
+           "from": "human",
+           "value": "user instruction"
+         },
+         {
+           "from": "gpt",
+           "value": "model response"
+         }
+       ],
+       "system": "system prompt (optional)",
+       "tools": "tool description (optional)"
+     }
+   ]
+2. Provide your dataset definition in ``data/dataset_info.json`` in the
+   following format .
+-  For ``alpaca`` format dataset, the columns in ``dataset_info.json``
+   should be:
+.. code:: json
+   "dataset_name": {
+     "file_name": "dataset_name.json",
+     "columns": {
+       "prompt": "instruction",
+       "query": "input",
+       "response": "output",
+       "system": "system",
+       "history": "history"
+     }
+   }
+-  For ``sharegpt`` format dataset, the columns in ``dataset_info.json``
+   should be:
+.. code:: json
+   "dataset_name": {
+       "file_name": "dataset_name.json",
+       "formatting": "sharegpt",
+       "columns": {
+         "messages": "conversations",
+         "system": "system",
+         "tools": "tools"
+       },
+       "tags": {
+         "role_tag": "from",
+         "content_tag": "value",
+         "user_tag": "user",
+         "assistant_tag": "assistant"
+       }
+     }
+Training
+--------
+Execute the following training command:
+.. code:: bash
+   DISTRIBUTED_ARGS="
+       --nproc_per_node $NPROC_PER_NODE \
+       --nnodes $NNODES \
+       --node_rank $NODE_RANK \
+       --master_addr $MASTER_ADDR \
+       --master_port $MASTER_PORT
+     "
+   torchrun $DISTRIBUTED_ARGS src/train.py \
+       --deepspeed $DS_CONFIG_PATH \
+       --stage sft \
+       --do_train \
+       --use_fast_tokenizer \
+       --flash_attn \
+       --model_name_or_path $MODEL_PATH \
+       --dataset your_dataset \
+       --template qwen \
+       --finetuning_type lora \
+       --lora_target q_proj,v_proj\
+       --output_dir $OUTPUT_PATH \
+       --overwrite_cache \
+       --overwrite_output_dir \
+       --warmup_steps 100 \
+       --weight_decay 0.1 \
+       --per_device_train_batch_size 4 \
+       --gradient_accumulation_steps 4 \
+       --ddp_timeout 9000 \
+       --learning_rate 5e-6 \
+       --lr_scheduler_type cosine \
+       --logging_steps 1 \
+       --cutoff_len 4096 \
+       --save_steps 1000 \
+       --plot_loss \
+       --num_train_epochs 3 \
+       --bf16 
+and enjoy the training process. To make changes to your training, you
+can modify the arguments in the training command to adjust the
+hyperparameters. One argument to note is ``cutoff_len``, which is the
+maximum length of the training data. Control this parameter to avoid OOM
+error.
+Merge LoRA
+----------
+If you train your model with LoRA, you probably need to merge adapter
+parameters to the main branch. Run the following command to perform the
+merging of LoRA adapters.
+.. code:: bash
+   CUDA_VISIBLE_DEVICES=0 llamafactory-cli export \
+       --model_name_or_path path_to_base_model \
+       --adapter_name_or_path path_to_adapter \
+       --template qwen \
+       --finetuning_type lora \
+       --export_dir path_to_export \
+       --export_size 2 \
+       --export_legacy_format False
+Conclusion
+----------
+The above content is the simplest way to use LLaMA-Factory to train
+Qwen. Feel free to dive into the details by checking the official repo!
--- a/docs/source/training/ms_swift.rst
+++ b/docs/source/training/ms_swift.rst
+SWIFT
+===========================================
+.. attention:: 
+    To be updated for Qwen3.
+ModelScope SWIFT (ms-swift) is the official large model and multimodal model training and deployment framework provided by the ModelScope community. 
+GitHub repository: `ms-swift <https://github.com/modelscope/ms-swift>`__
+Supervised Fine-Tuning (SFT)
+-----------------------------
+The SFT script in ms-swift has the following features:
+- Flexible training options: single-GPU and multi-GPU support
+- Efficient tuning methods: full-parameter, LoRA, Q-LoRA, and Dora
+- Broad model compatibility: supports various LLM and MLLM architectures
+For detailed model compatibility, see: `Supported Models <https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html>`__
+Environment Setup
+++++++++++++++++
+1. Follow the instructions of `ms-swift <https://github.com/modelscope/ms-swift>`__, and build the environment.
+2. Optional packages for advanced features::
+      pip install deepspeed  # For multi-GPU training
+      pip install flash-attn --no-build-isolation
+Data Preparation
+++++++++++++++++
+ms-swift supports multiple dataset formats:
+.. code-block:: text
+   # Standard messages format
+   {"messages": [
+       {"role": "system", "content": "<system-prompt>"},
+       {"role": "user", "content": "<query1>"},
+       {"role": "assistant", "content": "<response1>"}
+   ]}
+   # ShareGPT conversation format
+   {"system": "<system-prompt>", "conversation": [
+       {"human": "<query1>", "assistant": "<response1>"},
+       {"human": "<query2>", "assistant": "<response2>"}
+   ]}
+   # Instruction tuning format
+   {"system": "<system-prompt>", 
+    "instruction": "<task-instruction>", 
+    "input": "<additional-context>", 
+    "output": "<expected-response>"}
+   # Multimodal format (supports images, audio, video)
+   {"messages": [
+       {"role": "user", "content": "<image>Describe this image"},
+       {"role": "assistant", "content": "<description>"}
+   ], "images": ["/path/to/image.jpg"]}
+For complete dataset formatting guidelines, see: `Custom Dataset Documentation <https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html>`__
+Pre-built datasets are available at: `Supported Datasets <https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html#datasets>`__
+Training Examples
+++++++++++++++++
+Single-GPU Training
+^^^^^^^^^^^^^^^^^^^
+**LLM Example (Qwen2.5-7B-Instruct):**
+.. code-block:: bash
+    # 19GB
+    CUDA_VISIBLE_DEVICES=0 \
+    swift sft \
+       --model Qwen/Qwen2.5-7B-Instruct \
+       --dataset 'AI-ModelScope/alpaca-gpt4-data-zh' \
+       --train_type lora \
+       --lora_rank 8 \
+       --lora_alpha 32 \
+       --target_modules all-linear \
+       --torch_dtype bfloat16 \
+       --num_train_epochs 1 \
+       --per_device_train_batch_size 1 \
+       --gradient_accumulation_steps 16 \
+       --learning_rate 1e-4 \
+       --max_length 2048 \
+       --eval_steps 50 \
+       --save_steps 50 \
+       --save_total_limit 2 \
+       --logging_steps 5 \
+       --output_dir output \
+       --system 'You are a helpful assistant.' \
+       --warmup_ratio 0.05 \
+       --dataloader_num_workers 4 \
+        --attn_impl flash_attn
+**MLLM Example (Qwen2.5-VL-7B-Instruct):**
+.. code-block:: bash
+   # 18GB
+    CUDA_VISIBLE_DEVICES=0 \
+    MAX_PIXELS=602112 \
+    swift sft \
+       --model Qwen/Qwen2.5-VL-7B-Instruct \
+       --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite' \
+       --train_type lora \
+       --torch_dtype bfloat16 \
+       --num_train_epochs 1 \
+       --per_device_train_batch_size 1 \
+       --gradient_accumulation_steps 16 \
+       --learning_rate 1e-4 \
+       --max_length 2048 \
+       --eval_steps 200 \
+       --save_steps 200 \
+       --save_total_limit 5 \
+       --logging_steps 5 \
+       --output_dir output \
+       --warmup_ratio 0.05 \
+       --dataloader_num_workers 4
+Multi-GPU Training
+^^^^^^^^^^^^^^^^^^^
+**LLM Example with DeepSpeed:**
+.. code-block:: bash
+    # 18G*8
+    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+    NPROC_PER_NODE=8 \
+    nohup swift sft \
+        --model Qwen/Qwen2.5-7B-Instruct \
+        --dataset 'AI-ModelScope/alpaca-gpt4-data-zh' \
+        --train_type lora \
+        --lora_rank 8 \
+        --lora_alpha 32 \
+        --target_modules all-linear \
+        --torch_dtype bfloat16 \
+        --deepspeed zero2 \
+        --per_device_train_batch_size 1 \
+        --gradient_accumulation_steps 16 \
+        --learning_rate 1e-4 \
+        --max_length 2048 \
+        --num_train_epochs 1 \
+        --output_dir output \
+        --attn_impl flash_attn
+**MLLM Example with DeepSpeed:**
+.. code-block:: bash
+    # 17G*8
+    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+    NPROC_PER_NODE=8 \
+    MAX_PIXELS=602112 \
+    nohup swift sft \
+       --model Qwen/Qwen2.5-VL-7B-Instruct \
+       --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite' \
+       --train_type lora \
+       --deepspeed zero2 \
+       --per_device_train_batch_size 1 \
+       --gradient_accumulation_steps 8 \
+       --learning_rate 2e-5 \
+       --max_length 4096 \
+       --num_train_epochs 2 \
+       --output_dir output \
+        --attn_impl flash_attn
+Reinforcement Learning (RL)
+-----------------------------
+The RL script in ms-swift has the following features:
+- Support single-GPU and multi-GPU training
+- Support full-parameter tuning, LoRA, Q-LoRA, and Dora
+- Supports multiple RL algorithms including GRPO, DAPO, PPO, DPO, KTO, ORPO, CPO, and SimPO
+- Supports both large language models (LLM) and multimodal models (MLLM)
+For detailed support information, please refer to: `Supported Features <https://swift.readthedocs.io/en/latest/Instruction/Pre-training-and-Fine-tuning.html#pre-training-and-fine-tuning>`__
+Environment Setup
++++++++++++++++++
+1. Follow the instructions of `ms-swift <https://github.com/modelscope/ms-swift>`__, and build the environment.
+2. Install these packages (Optional)::
+      pip install deepspeed
+      pip install math_verify==0.5.2
+      pip install flash-attn --no-build-isolation
+      pip install vllm
+Data Preparation
++++++++++++++++
+ms-swift has built-in preprocessing logic for several datasets, which can be directly used for training via the ``--dataset`` parameter. For supported datasets, please refer to: `Supported Datasets <https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html#datasets>`__
+You can also use local custom datasets by providing the local dataset path to the ``--dataset`` parameter.
+Example Dataset Formats:
+.. code-block:: text
+   # llm
+   {"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
+   {"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
+   {"messages": [{"role": "user", "content": "What is your name?"}]}
+   # mllm
+   {"messages": [{"role": "user", "content": "<image>What is the difference between the two images?"}], "images": ["/xxx/x.jpg"]}
+   {"messages": [{"role": "user", "content": "<image><image>What is the difference between the two images?"}], "images": ["/xxx/y.jpg", "/xxx/z.png"]}
+Notes on Dataset Requirements
+1. Reward Function Calculation: Depending on the reward function being used, additional columns may be required in the dataset. For example:
+      When using the built-in accuracy/cosine reward, the dataset must include a ``solution`` column to compute accuracy.
+      The other columns in the dataset will also be passed to the `kwargs` of the reward function. 
+2. Customizing the Reward Function: To tailor the reward function to your specific needs, you can refer to the following resource: `external reward plugin <https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/plugin>`__
+GRPO Training Examples
+++++++++++++++++++++++
+Single-GPU Configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+**LLM (Qwen2.5-7B):**
+.. code-block:: bash
+   # 42G
+   CUDA_VISIBLE_DEVICES=0 \
+   nohup swift rlhf \
+       --rlhf_type grpo \
+       --model Qwen/Qwen2.5-7B \
+       --vllm_gpu_memory_utilization 0.5 \
+       --use_vllm true \
+       --sleep_level 1 \
+       --offload_model true \
+       --offload_optimizer true \
+       --gc_collect_after_offload true \
+       --reward_funcs accuracy format \
+       --train_type lora \
+       --lora_rank 8 \
+       --lora_alpha 32 \
+       --target_modules all-linear \
+       --torch_dtype bfloat16 \
+       --dataset 'AI-MO/NuminaMath-TIR' \
+       --max_completion_length 1024 \
+       --num_train_epochs 1 \
+       --per_device_train_batch_size 4 \
+       --per_device_eval_batch_size 4 \
+       --learning_rate 1e-5 \
+       --gradient_accumulation_steps 1 \
+       --eval_steps 100 \
+       --save_steps 100 \
+       --save_total_limit 2 \
+       --logging_steps 5 \
+       --max_length 2048 \
+       --output_dir output \
+       --warmup_ratio 0.05 \
+       --dataloader_num_workers 4 \
+       --dataset_num_proc 4 \
+       --num_generations 4 \
+       --temperature 0.9 \
+       --system 'examples/train/grpo/prompt.txt' \
+       --log_completions true
+**MLLM (Qwen2.5-VL-7B-Instruct):**
+.. code-block:: bash
+   # 55G
+   CUDA_VISIBLE_DEVICES=0 \
+   MAX_PIXELS=602112 \
+   swift rlhf \
+       --rlhf_type grpo \
+       --model Qwen/Qwen2.5-VL-7B-Instruct \
+       --vllm_gpu_memory_utilization 0.5 \
+       --use_vllm true \
+       --sleep_level 1 \
+       --offload_model true \
+       --offload_optimizer true \
+       --gc_collect_after_offload true \
+       --external_plugins examples/train/grpo/plugin/plugin.py \
+       --reward_funcs external_r1v_acc format \
+       --train_type lora \
+       --lora_rank 8 \
+       --lora_alpha 32 \
+       --target_modules all-linear \
+       --torch_dtype bfloat16 \
+       --dataset 'lmms-lab/multimodal-open-r1-8k-verified' \
+       --vllm_max_model_len 4196 \
+       --max_completion_length 1024 \
+       --num_train_epochs 1 \
+       --per_device_train_batch_size 4 \
+       --per_device_eval_batch_size 4 \
+       --learning_rate 1e-5 \
+       --gradient_accumulation_steps 1 \
+       --eval_steps 100 \
+       --save_steps 100 \
+       --save_total_limit 2 \
+       --logging_steps 5 \
+       --output_dir output \
+       --warmup_ratio 0.05 \
+       --dataloader_num_workers 4 \
+       --dataset_num_proc 4 \
+       --num_generations 4 \
+       --temperature 0.9 \
+       --system 'examples/train/grpo/prompt.txt' \
+       --log_completions true
+Multi-GPU Training
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+**LLM Example with DeepSpeed:**
+.. code-block:: bash
+   # 60G*8
+   CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+   NPROC_PER_NODE=8 \
+   swift rlhf \
+       --rlhf_type grpo \
+       --model Qwen/Qwen2.5-7B \
+       --reward_funcs accuracy format \
+       --use_vllm true \
+       --vllm_device auto \
+       --vllm_gpu_memory_utilization 0.7 \
+       --vllm_max_model_len 8192 \
+       --num_infer_workers 8 \
+       --train_type lora \
+       --torch_dtype bfloat16 \
+       --dataset 'AI-MO/NuminaMath-TIR' \
+       --max_completion_length 2048 \
+       --num_train_epochs 1 \
+       --per_device_train_batch_size 1 \
+       --per_device_eval_batch_size 1 \
+       --learning_rate 1e-6 \
+       --gradient_accumulation_steps 2 \
+       --eval_steps 200 \
+       --save_steps 200 \
+       --save_total_limit 2 \
+       --logging_steps 5 \
+       --max_length 4096 \
+       --output_dir output \
+       --warmup_ratio 0.05 \
+       --dataloader_num_workers 4 \
+       --dataset_num_proc 4 \
+       --num_generations 8 \
+       --temperature 0.9 \
+       --system 'examples/train/grpo/prompt.txt' \
+       --deepspeed zero2 \
+       --log_completions true \
+       --sleep_level 1 \
+       --offload_model true \
+       --offload_optimizer true \
+       --gc_collect_after_offload true \
+       --log_completions true
+**MLLM Example with DeepSpeed:**
+.. code-block:: bash
+   # 60G*8
+   CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+   NPROC_PER_NODE=8 \
+   nohup swift rlhf \
+       --rlhf_type grpo \
+       --model Qwen/Qwen2.5-VL-7B-Instruct \
+       --external_plugins examples/train/grpo/plugin/plugin.py \ 
+       --reward_funcs external_r1v_acc format \
+       --use_vllm true \
+       --vllm_device auto \
+       --vllm_gpu_memory_utilization 0.7 \
+       --vllm_max_model_len 8192 \
+       --num_infer_workers 8 \
+       --train_type lora \
+       --torch_dtype bfloat16 \
+       --dataset 'lmms-lab/multimodal-open-r1-8k-verified' \
+       --max_completion_length 2048 \
+       --num_train_epochs 1 \
+       --per_device_train_batch_size 1 \
+       --per_device_eval_batch_size 1 \
+       --learning_rate 1e-6 \
+       --gradient_accumulation_steps 2 \
+       --eval_steps 200 \
+       --save_steps 200 \
+       --save_total_limit 2 \
+       --logging_steps 5 \
+       --vllm_max_model_len 4196 \
+       --output_dir output \
+       --warmup_ratio 0.05 \
+       --dataloader_num_workers 4 \
+       --dataset_num_proc 4 \
+       --num_generations 8 \
+       --temperature 0.9 \
+       --system 'examples/train/grpo/prompt.txt' \
+       --deepspeed zero2 \
+       --log_completions true \
+       --sleep_level 1 \
+       --offload_model true \
+       --offload_optimizer true \
+       --gc_collect_after_offload true \
+       --log_completions true
+Model Export
+-------------------------
+**Merge LoRA Adapters:**
+.. code-block:: bash
+    swift export \
+       --adapters output/checkpoint-xxx \
+       --merge_lora true
+**Push to ModelScope Hub:**
+.. code-block:: bash
+    swift export \
+       --adapters output/checkpoint-xxx \
+       --push_to_hub true \
+       --hub_model_id '<your-namespace>/<model-name>' \
+       --hub_token '<your-access-token>'
\ No newline at end of file
--- a/examples/demo/cli_demo.py
+++ b/examples/demo/cli_demo.py
+# Copyright (c) Alibaba Cloud.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+"""A simple command-line interactive chat demo."""
+import argparse
+import os
+import platform
+import shutil
+from copy import deepcopy
+from threading import Thread
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
+from transformers.trainer_utils import set_seed
+DEFAULT_CKPT_PATH = "Qwen/Qwen2.5-7B-Instruct"
+_WELCOME_MSG = """\
+Welcome to use Qwen2.5-Instruct model, type text to start chat, type :h to show command help.
+(欢迎使用 Qwen2.5-Instruct 模型，输入内容即可进行对话，:h 显示命令帮助。)
+Note: This demo is governed by the original license of Qwen2.5.
+We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, including hate speech, violence, pornography, deception, etc.
+(注：本演示受Qwen2.5的许可协议限制。我们强烈建议，用户不应传播及不应允许他人传播以下内容，包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)
+"""
+_HELP_MSG = """\
+Commands:
+    :help / :h              Show this help message              显示帮助信息
+    :exit / :quit / :q      Exit the demo                       退出Demo
+    :clear / :cl            Clear screen                        清屏
+    :clear-history / :clh   Clear history                       清除对话历史
+    :history / :his         Show history                        显示对话历史
+    :seed                   Show current random seed            显示当前随机种子
+    :seed <N>               Set random seed to <N>              设置随机种子
+    :conf                   Show current generation config      显示生成配置
+    :conf <key>=<value>     Change generation config            修改生成配置
+    :reset-conf             Reset generation config             重置生成配置
+"""
+_ALL_COMMAND_NAMES = [
+    "help",
+    "h",
+    "exit",
+    "quit",
+    "q",
+    "clear",
+    "cl",
+    "clear-history",
+    "clh",
+    "history",
+    "his",
+    "seed",
+    "conf",
+    "reset-conf",
+]
+def _setup_readline():
+    try:
+        import readline
+    except ImportError:
+        return
+    _matches = []
+    def _completer(text, state):
+        nonlocal _matches
+        if state == 0:
+            _matches = [
+                cmd_name for cmd_name in _ALL_COMMAND_NAMES if cmd_name.startswith(text)
+            ]
+        if 0 <= state < len(_matches):
+            return _matches[state]
+        return None
+    readline.set_completer(_completer)
+    readline.parse_and_bind("tab: complete")
+def _load_model_tokenizer(args):
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path,
+        resume_download=True,
+    )
+    if args.cpu_only:
+        device_map = "cpu"
+    else:
+        device_map = "auto"
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path,
+        torch_dtype="auto",
+        device_map=device_map,
+        resume_download=True,
+    ).eval()
+    model.generation_config.max_new_tokens = 2048  # For chat.
+    return model, tokenizer
+def _gc():
+    import gc
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+def _clear_screen():
+    if platform.system() == "Windows":
+        os.system("cls")
+    else:
+        os.system("clear")
+def _print_history(history):
+    terminal_width = shutil.get_terminal_size()[0]
+    print(f"History ({len(history)})".center(terminal_width, "="))
+    for index, (query, response) in enumerate(history):
+        print(f"User[{index}]: {query}")
+        print(f"Qwen[{index}]: {response}")
+    print("=" * terminal_width)
+def _get_input() -> str:
+    while True:
+        try:
+            message = input("User> ").strip()
+        except UnicodeDecodeError:
+            print("[ERROR] Encoding error in input")
+            continue
+        except KeyboardInterrupt:
+            exit(1)
+        if message:
+            return message
+        print("[ERROR] Query is empty")
+def _chat_stream(model, tokenizer, query, history):
+    conversation = []
+    for query_h, response_h in history:
+        conversation.append({"role": "user", "content": query_h})
+        conversation.append({"role": "assistant", "content": response_h})
+    conversation.append({"role": "user", "content": query})
+    input_text = tokenizer.apply_chat_template(
+        conversation,
+        add_generation_prompt=True,
+        tokenize=False,
+    )
+    inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
+    streamer = TextIteratorStreamer(
+        tokenizer=tokenizer, skip_prompt=True, timeout=60.0, skip_special_tokens=True
+    )
+    generation_kwargs = {
+        **inputs,
+        "streamer": streamer,
+    }
+    thread = Thread(target=model.generate, kwargs=generation_kwargs)
+    thread.start()
+    for new_text in streamer:
+        yield new_text
+def main():
+    parser = argparse.ArgumentParser(
+        description="Qwen2.5-Instruct command-line interactive chat demo."
+    )
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=str,
+        default=DEFAULT_CKPT_PATH,
+        help="Checkpoint name or path, default to %(default)r",
+    )
+    parser.add_argument("-s", "--seed", type=int, default=1234, help="Random seed")
+    parser.add_argument(
+        "--cpu-only", action="store_true", help="Run demo with CPU only"
+    )
+    args = parser.parse_args()
+    history, response = [], ""
+    model, tokenizer = _load_model_tokenizer(args)
+    orig_gen_config = deepcopy(model.generation_config)
+    _setup_readline()
+    _clear_screen()
+    print(_WELCOME_MSG)
+    seed = args.seed
+    while True:
+        query = _get_input()
+        # Process commands.
+        if query.startswith(":"):
+            command_words = query[1:].strip().split()
+            if not command_words:
+                command = ""
+            else:
+                command = command_words[0]
+            if command in ["exit", "quit", "q"]:
+                break
+            elif command in ["clear", "cl"]:
+                _clear_screen()
+                print(_WELCOME_MSG)
+                _gc()
+                continue
+            elif command in ["clear-history", "clh"]:
+                print(f"[INFO] All {len(history)} history cleared")
+                history.clear()
+                _gc()
+                continue
+            elif command in ["help", "h"]:
+                print(_HELP_MSG)
+                continue
+            elif command in ["history", "his"]:
+                _print_history(history)
+                continue
+            elif command in ["seed"]:
+                if len(command_words) == 1:
+                    print(f"[INFO] Current random seed: {seed}")
+                    continue
+                else:
+                    new_seed_s = command_words[1]
+                    try:
+                        new_seed = int(new_seed_s)
+                    except ValueError:
+                        print(
+                            f"[WARNING] Fail to change random seed: {new_seed_s!r} is not a valid number"
+                        )
+                    else:
+                        print(f"[INFO] Random seed changed to {new_seed}")
+                        seed = new_seed
+                    continue
+            elif command in ["conf"]:
+                if len(command_words) == 1:
+                    print(model.generation_config)
+                else:
+                    for key_value_pairs_str in command_words[1:]:
+                        eq_idx = key_value_pairs_str.find("=")
+                        if eq_idx == -1:
+                            print("[WARNING] format: <key>=<value>")
+                            continue
+                        conf_key, conf_value_str = (
+                            key_value_pairs_str[:eq_idx],
+                            key_value_pairs_str[eq_idx + 1 :],
+                        )
+                        try:
+                            conf_value = eval(conf_value_str)
+                        except Exception as e:
+                            print(e)
+                            continue
+                        else:
+                            print(
+                                f"[INFO] Change config: model.generation_config.{conf_key} = {conf_value}"
+                            )
+                            setattr(model.generation_config, conf_key, conf_value)
+                continue
+            elif command in ["reset-conf"]:
+                print("[INFO] Reset generation config")
+                model.generation_config = deepcopy(orig_gen_config)
+                print(model.generation_config)
+                continue
+            else:
+                # As normal query.
+                pass
+        # Run chat.
+        set_seed(seed)
+        _clear_screen()
+        print(f"\nUser: {query}")
+        print(f"\nQwen: ", end="")
+        try:
+            partial_text = ""
+            for new_text in _chat_stream(model, tokenizer, query, history):
+                print(new_text, end="", flush=True)
+                partial_text += new_text
+            response = partial_text
+            print()
+        except KeyboardInterrupt:
+            print("[WARNING] Generation interrupted")
+            continue
+        history.append((query, response))
+if __name__ == "__main__":
+    main()
--- a/examples/demo/web_demo.py
+++ b/examples/demo/web_demo.py
+# Copyright (c) Alibaba Cloud.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+"""A simple web interactive chat demo based on gradio."""
+from argparse import ArgumentParser
+from threading import Thread
+import gradio as gr
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
+DEFAULT_CKPT_PATH = "Qwen/Qwen2.5-7B-Instruct"
+def _get_args():
+    parser = ArgumentParser(description="Qwen2.5-Instruct web chat demo.")
+    parser.add_argument(
+        "-c",
+        "--checkpoint-path",
+        type=str,
+        default=DEFAULT_CKPT_PATH,
+        help="Checkpoint name or path, default to %(default)r",
+    )
+    parser.add_argument(
+        "--cpu-only", action="store_true", help="Run demo with CPU only"
+    )
+    parser.add_argument(
+        "--share",
+        action="store_true",
+        default=False,
+        help="Create a publicly shareable link for the interface.",
+    )
+    parser.add_argument(
+        "--inbrowser",
+        action="store_true",
+        default=False,
+        help="Automatically launch the interface in a new tab on the default browser.",
+    )
+    parser.add_argument(
+        "--server-port", type=int, default=8000, help="Demo server port."
+    )
+    parser.add_argument(
+        "--server-name", type=str, default="127.0.0.1", help="Demo server name."
+    )
+    args = parser.parse_args()
+    return args
+def _load_model_tokenizer(args):
+    tokenizer = AutoTokenizer.from_pretrained(
+        args.checkpoint_path,
+        resume_download=True,
+    )
+    if args.cpu_only:
+        device_map = "cpu"
+    else:
+        device_map = "auto"
+    model = AutoModelForCausalLM.from_pretrained(
+        args.checkpoint_path,
+        torch_dtype="auto",
+        device_map=device_map,
+        resume_download=True,
+    ).eval()
+    model.generation_config.max_new_tokens = 2048  # For chat.
+    return model, tokenizer
+def _chat_stream(model, tokenizer, query, history):
+    conversation = []
+    for query_h, response_h in history:
+        conversation.append({"role": "user", "content": query_h})
+        conversation.append({"role": "assistant", "content": response_h})
+    conversation.append({"role": "user", "content": query})
+    input_text = tokenizer.apply_chat_template(
+        conversation,
+        add_generation_prompt=True,
+        tokenize=False,
+    )
+    inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
+    streamer = TextIteratorStreamer(
+        tokenizer=tokenizer, skip_prompt=True, timeout=60.0, skip_special_tokens=True
+    )
+    generation_kwargs = {
+        **inputs,
+        "streamer": streamer,
+    }
+    thread = Thread(target=model.generate, kwargs=generation_kwargs)
+    thread.start()
+    for new_text in streamer:
+        yield new_text
+def _gc():
+    import gc
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+def _launch_demo(args, model, tokenizer):
+    def predict(_query, _chatbot, _task_history):
+        print(f"User: {_query}")
+        _chatbot.append((_query, ""))
+        full_response = ""
+        response = ""
+        for new_text in _chat_stream(model, tokenizer, _query, history=_task_history):
+            response += new_text
+            _chatbot[-1] = (_query, response)
+            yield _chatbot
+            full_response = response
+        print(f"History: {_task_history}")
+        _task_history.append((_query, full_response))
+        print(f"Qwen: {full_response}")
+    def regenerate(_chatbot, _task_history):
+        if not _task_history:
+            yield _chatbot
+            return
+        item = _task_history.pop(-1)
+        _chatbot.pop(-1)
+        yield from predict(item[0], _chatbot, _task_history)
+    def reset_user_input():
+        return gr.update(value="")
+    def reset_state(_chatbot, _task_history):
+        _task_history.clear()
+        _chatbot.clear()
+        _gc()
+        return _chatbot
+    with gr.Blocks() as demo:
+        gr.Markdown("""\
+<p align="center"><img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/assets/logo/qwen2.5_logo.png" style="height: 120px"/><p>""")
+        gr.Markdown(
+            """\
+<center><font size=3>This WebUI is based on Qwen2.5-Instruct, developed by Alibaba Cloud. \
+(本WebUI基于Qwen2.5-Instruct打造，实现聊天机器人功能。)</center>"""
+        )
+        gr.Markdown("""\
+<center><font size=4>
+Qwen2.5-7B-Instruct <a href="https://modelscope.cn/models/qwen/Qwen2.5-7B-Instruct/summary">🤖 </a> | 
+<a href="https://huggingface.co/Qwen/Qwen2.5-7B-Instruct">🤗</a>&nbsp ｜ 
+Qwen2.5-32B-Instruct <a href="https://modelscope.cn/models/qwen/Qwen2.5-32B-Instruct/summary">🤖 </a> | 
+<a href="https://huggingface.co/Qwen/Qwen2.5-32B-Instruct">🤗</a>&nbsp ｜ 
+Qwen2.5-72B-Instruct <a href="https://modelscope.cn/models/qwen/Qwen2.5-72B-Instruct/summary">🤖 </a> | 
+<a href="https://huggingface.co/Qwen/Qwen2.5-72B-Instruct">🤗</a>&nbsp ｜ 
+&nbsp<a href="https://github.com/QwenLM/Qwen2.5">Github</a></center>""")
+        chatbot = gr.Chatbot(label="Qwen", elem_classes="control-height")
+        query = gr.Textbox(lines=2, label="Input")
+        task_history = gr.State([])
+        with gr.Row():
+            empty_btn = gr.Button("🧹 Clear History (清除历史)")
+            submit_btn = gr.Button("🚀 Submit (发送)")
+            regen_btn = gr.Button("🤔️ Regenerate (重试)")
+        submit_btn.click(
+            predict, [query, chatbot, task_history], [chatbot], show_progress=True
+        )
+        submit_btn.click(reset_user_input, [], [query])
+        empty_btn.click(
+            reset_state, [chatbot, task_history], outputs=[chatbot], show_progress=True
+        )
+        regen_btn.click(
+            regenerate, [chatbot, task_history], [chatbot], show_progress=True
+        )
+        gr.Markdown("""\
+<font size=2>Note: This demo is governed by the original license of Qwen2.5. \
+We strongly advise users not to knowingly generate or allow others to knowingly generate harmful content, \
+including hate speech, violence, pornography, deception, etc. \
+(注：本演示受Qwen2.5的许可协议限制。我们强烈建议，用户不应传播及不应允许他人传播以下内容，\
+包括但不限于仇恨言论、暴力、色情、欺诈相关的有害信息。)""")
+    demo.queue().launch(
+        share=args.share,
+        inbrowser=args.inbrowser,
+        server_port=args.server_port,
+        server_name=args.server_name,
+    )
+def main():
+    args = _get_args()
+    model, tokenizer = _load_model_tokenizer(args)
+    _launch_demo(args, model, tokenizer)
+if __name__ == "__main__":
+    main()
--- a/examples/gcu-support/README.md
+++ b/examples/gcu-support/README.md
+# Qwen2.5 推理
+## 1、配置运行环境
+**安装驱动**
+```
+# <version_id> 为软件包具体版本号。
+chmod +x TopsRider_i3x_<version_id>_deb_amd64.run
+./TopsRider_i3x_<version_id>_deb_amd64.run -y
+```
+**创建并启动 docker**
+```
+# 创建 docker 容器，将在基础镜像 artifact.enflame.cn/enflame_docker_images/ubuntu/qic_ubuntu_2004_gcc9:1.4.4 的基础上创建 docker。
+# <project_path> 当前工程所在路径
+# -e ENFLAME_VISIBLE_DEVICES=2 进行 GCU 资源隔离，如需多卡可以改为 0,1,2,3 等
+docker run -itd -e ENFLAME_VISIBLE_DEVICES=2 --name qwen-infer -v <project_path>:/work -v /root/:/root/ --privileged --network host  artifact.enflame.cn/enflame_docker_images/ubuntu/qic_ubuntu_2004_gcc9:1.4.4 bash
+```
+**进入 docker 安装环境**
+```
+# 进入 docker 容器
+docker exec -it qwen-infer bash
+# 安装 SDK 框架，进入软件包所在地址。
+# <version_id> 为软件包具体版本号。
+./TopsRider_i3x_<version_id>_amd64.run -C torch-gcu-2 -y
+./TopsRider_i3x_<version_id>_deb_amd64.run -C tops-sdk -y
+# 安装 python 库
+pip3.8 install transformers==4.40.2
+pip3.8 install accelerate
+```
+## 2、推理
+```
+# 进入本工程目录，包含运行代码、推理输入等文件。
+.
+├── README.md
+└── gcu_demo.py
+```
+**启动推理示例**
+```
+python3.8 gcu_demo.py
+```
+执行 gcu_demo.py 推理示例，代码改编自 [仓库 README](https://github.com/QwenLM/Qwen2.5/blob/main/README.md) 中的给的 Huggingface quick start 用例。
+**GCU PyTorch 原生推理支持**
+GCU 支持 pytorch 原生推理，在 pytorch 代码上只需做少许改动就可以在 GCU 上顺利运行：
+1. 导入 *torch_gcu* 后端库，并载入 transfer_to_gcu
+    ``` python
+    try:
+        import torch_gcu # 导入 torch_gcu
+        from torch_gcu import transfer_to_gcu #  transfer_to_gcu
+    except Exception as e:
+        print(e)
+    ```
+2. device 名改为 *gcu*
+   ``` python
+   device = "gcu"
+   ```
+**GCU vLLM 推理**
+GCU 也支持 *vLLM* 原生推理，需要安装 GCU 版本的 *vLLM* 后，将设备名改为 gcu
+```
+python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2.5-7B-Instruct --model Qwen/Qwen2.5-7B-Instruct --device gcu
+```
--- a/examples/gcu-support/gcu_demo.py
+++ b/examples/gcu-support/gcu_demo.py
+try:
+    import torch_gcu # 导入 torch_gcu
+    from torch_gcu import transfer_to_gcu #  transfer_to_gcu
+except Exception as e:
+    print(e)
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "Qwen/Qwen2.5-7B-Instruct"
+device = "gcu" # the device to load the model onto
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+prompt = "Give me a short introduction to large language model."
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(device)
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=512
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
--- a/examples/llama-factory/finetune-zh.md
+++ b/examples/llama-factory/finetune-zh.md
+# 使用LLaMA-Factory微调Qwen模型
+## LLAMA-Factory简介
+[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)是一个简单易用且高效的大模型训练框架，支持上百种大模型的训练，框架特性主要包括：
+- 模型种类：LLaMA、LLaVA、Mistral、Mixtral-MoE、Qwen、Yi、Gemma、Baichuan、ChatGLM、Phi 等等。
+- 训练算法：（增量）预训练、（多模态）指令监督微调、奖励模型训练、PPO 训练、DPO 训练、KTO 训练、ORPO 训练等等。
+- 运算精度：16比特全参数微调、冻结微调、LoRA微调和基于AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ的2/3/4/5/6/8比特QLoRA 微调。
+- 优化算法：GaLore、BAdam、DoRA、LongLoRA、LLaMA Pro、Mixture-of-Depths、LoRA+、LoftQ和PiSSA。
+- 加速算子：FlashAttention-2和Unsloth。
+- 推理引擎：Transformers和vLLM。
+- 实验面板：LlamaBoard、TensorBoard、Wandb、MLflow等等。
+本文将介绍如何使用LLAMA-Factory对Qwen2系列大模型进行微调（Qwen1.5系列模型也适用），更多特性请参考[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)。
+## 安装LLaMA-Factory
+下载并安装LLaMA-Factory：
+```bash
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
+cd LLaMA-Factory
+pip install -e ".[torch,metrics]"
+```
+安装完成后，执行`llamafactory-cli version`，若出现以下提示，则表明安装成功：
+```
+----------------------------------------------------------
+| Welcome to LLaMA Factory, version 0.8.4.dev0           |
+|                                                        |
+| Project page: https://github.com/hiyouga/LLaMA-Factory |
+----------------------------------------------------------
+```
+## 准备训练数据
+自定义的训练数据应保存为jsonl文件，每一行的格式如下：
+```json
+{
+    "messages": [
+        {
+            "role": "system",
+            "content": "You are a helpful assistant."
+        },
+        {
+            "role": "user",
+            "content": "Tell me something about large language models."
+        },
+        {
+            "role": "assistant",
+            "content": "Large language models are a type of language model that is trained on a large corpus of text data. They are capable of generating human-like text and are used in a variety of natural language processing tasks..."
+        },
+        {
+            "role": "user",
+            "content": "How about Qwen2?"
+        },
+        {
+            "role": "assistant",
+            "content": "Qwen2 is a large language model developed by Alibaba Cloud..."
+        }
+    ]
+}
+```
+在LLaMA-Factory文件夹下的`data/dataset_info.json`文件中注册自定义的训练数据，在文件尾部添加如下配置信息：
+```
+"qwen_train_data": {
+    "file_name": "PATH-TO-YOUR-TRAIN-DATA",
+    "formatting": "sharegpt",
+    "columns": {
+      "messages": "messages"
+    },
+    "tags": {
+      "role_tag": "role",
+      "content_tag": "content",
+      "user_tag": "user",
+      "assistant_tag": "assistant",
+      "system_tag": "system"
+    }
+}
+```
+## 配置训练参数
+设置训练参数的配置文件，我们提供了全量参数、LoRA、QLoRA训练所对应的示例文件，你可以根据自身需求自行修改，配置详情见本目录下对应的文件:
+- `qwen2-7b-full-sft.yaml`: 全量参数训练
+- `qwen2-7b-lora-sft.yaml`: LoRA训练
+- `qwen2-7b-qlora-sft.yaml`: QLoRA训练
+全量参数训练时的deepspeed配置文件可参考[文件](https://github.com/hiyouga/LLaMA-Factory/tree/main/examples/deepspeed)
+部分训练参数说明：
+| 参数                          | 说明                                                                                           |
+|-----------------------------|----------------------------------------------------------------------------------------------|
+| model_name_or_path          | 模型名称或路径                                                                                      |
+| stage                       | 训练阶段，可选: rm(reward modeling), pt(pretrain), sft(Supervised Fine-Tuning), PPO, DPO, KTO, ORPO |
+| do_train                    | true用于训练, false用于评估                                                                          |
+| finetuning_type             | 微调方式。可选: freeze, LoRA, full                                                                  |
+| lora_target                 | 采取LoRA方法的目标模块，默认值为all。                                                                       |
+| dataset                     | 使用的数据集，使用”,”分隔多个数据集                                                                          |
+| template                    | 数据集模板，请保证数据集模板与模型相对应。                                                                        |
+| output_dir                  | 输出路径                                                                                         |
+| logging_steps               | 日志输出步数间隔                                                                                     |
+| save_steps                  | 模型断点保存间隔                                                                                     |
+| overwrite_output_dir        | 是否允许覆盖输出目录                                                                                   |
+| per_device_train_batch_size | 每个设备上训练的批次大小                                                                                 |
+| gradient_accumulation_steps | 梯度积累步数                                                                                       |
+| learning_rate               | 学习率                                                                                          |
+| lr_scheduler_type           | 学习率曲线，可选 linear, cosine, polynomial, constant 等。                                             |
+| num_train_epochs            | 训练周期数                                                                                        |
+| bf16                        | 是否使用 bf16 格式                                                                                 |
+## 开始训练
+全量参数训练：
+```bash
+FORCE_TORCHRUN=1 llamafactory-cli train qwen2-7b-full-sft.yaml 
+```
+LoRA训练：
+```bash
+llamafactory-cli train qwen2-7b-lora-sft.yaml 
+```
+QLoRA训练：
+```bash
+llamafactory-cli train qwen2-7b-qlora-sft.yaml 
+```
+使用上述训练配置，各个方法实测的显存占用如下。训练中的显存占用与训练参数配置息息相关，可根据自身实际需求进行设置。
+- 全量参数训练：42.18GB
+- LoRA训练：20.17GB
+- QLoRA训练: 10.97GB
+## 合并模型权重
+如果采用LoRA或者QLoRA进行训练，脚本只保存对应的LoRA权重，需要合并权重才能进行推理。**全量参数训练无需执行此步骤**
+```bash
+llamafactory-cli export qwen2-7b-merge-lora.yaml
+```
+权重合并的部分参数说明：
+| 参数                   | 说明          |
+|----------------------|-------------|
+| model_name_or_path   | 预训练模型的名称或路径 |
+| template             | 模型模板        |
+| export_dir           | 导出路径        |
+| export_size          | 最大导出模型文件大小  |
+| export_device        | 导出设备        |
+| export_legacy_format | 是否使用旧格式导出   |
+注意：
+- 合并Qwen2模型权重，务必将template设为`qwen`；无论LoRA还是QLoRA训练，合并权重时，`finetuning_type`均为`lora`。
+- adapter_name_or_path需要与微调中的适配器输出路径output_dir相对应。
+## 模型推理
+训练完成，合并模型权重之后，即可加载完整的模型权重进行推理， 推理的示例脚本如下：
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+device = "cuda" # the device to load the model onto
+model_name_or_path = YOUR-MODEL-PATH
+model = AutoModelForCausalLM.from_pretrained(
+    model_name_or_path,
+    torch_dtype="auto",
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+prompt = "Give me a short introduction to large language model."
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(device)
+generated_ids = model.generate(
+    model_inputs.input_ids,
+    max_new_tokens=512
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+```
--- a/examples/llama-factory/qwen2-7b-full-sft.yaml
+++ b/examples/llama-factory/qwen2-7b-full-sft.yaml
+### model
+model_name_or_path: Qwen/Qwen2-7B-Instruct
+### method
+stage: sft
+do_train: true
+finetuning_type: full
+deepspeed: PATH-TO-DS-CONFIG
+### dataset
+dataset: qwen_train_data
+template: qwen
+cutoff_len: 1024
+overwrite_cache: true
+preprocessing_num_workers: 16
+### output
+output_dir: saves/qwen2-7b/full/sft
+logging_steps: 10
+save_steps: 100
+plot_loss: true
+overwrite_output_dir: true
+### train
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 16
+learning_rate: 1.0e-5
+num_train_epochs: 1.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+### eval
+val_size: 0.1
+per_device_eval_batch_size: 1
+eval_strategy: steps
+eval_steps: 500
--- a/examples/llama-factory/qwen2-7b-lora-sft.yaml
+++ b/examples/llama-factory/qwen2-7b-lora-sft.yaml
+### model
+model_name_or_path: Qwen/Qwen2-7B-Instruct
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_target: all
+lora_rank: 16
+lora_alpha: 16
+lora_dropout: 0.05
+### dataset
+dataset: qwen_train_data
+template: qwen
+cutoff_len: 1024
+overwrite_cache: true
+preprocessing_num_workers: 16
+### output
+output_dir: saves/qwen2-7b/lora/sft
+logging_steps: 100
+save_steps: 100
+plot_loss: true
+overwrite_output_dir: true
+### train
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 16
+learning_rate: 1.0e-4
+num_train_epochs: 1.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+### eval
+val_size: 0.1
+per_device_eval_batch_size: 1
+eval_strategy: steps
+eval_steps: 500
--- a/examples/llama-factory/qwen2-7b-merge-lora.yaml
+++ b/examples/llama-factory/qwen2-7b-merge-lora.yaml
+### Note: DO NOT use quantized model or quantization_bit when merging lora adapters
+### model
+model_name_or_path: Qwen/Qwen2-7B-Instruct
+adapter_name_or_path: PATH-TO-LORA
+template: qwen
+finetuning_type: lora
+### export
+export_dir: models/qwen2-7b-sft-lora-merged
+export_size: 2
+export_device: cpu
+export_legacy_format: false
\ No newline at end of file
--- a/examples/llama-factory/qwen2-7b-qlora-sft.yaml
+++ b/examples/llama-factory/qwen2-7b-qlora-sft.yaml
+### model
+model_name_or_path: Qwen/Qwen2-7B-Instruct
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_target: all
+quantization_bit: 4
+quantization_method: bitsandbytes  # choices: [bitsandbytes (4/8), hqq (2/3/4/5/6/8), eetq (8)]
+lora_rank: 16
+lora_alpha: 16
+lora_dropout: 0.05
+### dataset
+dataset: qwen_train_data
+template: qwen
+cutoff_len: 1024
+overwrite_cache: true
+preprocessing_num_workers: 16
+### output
+output_dir: saves/qwen2-7b/qlora/sft
+logging_steps: 100
+save_steps: 100
+plot_loss: true
+overwrite_output_dir: true
+### train
+per_device_train_batch_size: 1
+gradient_accumulation_steps: 16
+learning_rate: 1.0e-4
+num_train_epochs: 1.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+### eval
+val_size: 0.1
+per_device_eval_batch_size: 1
+eval_strategy: steps
+eval_steps: 500
--- a/examples/speed-benchmark/README.md
+++ b/examples/speed-benchmark/README.md
+# Speed Benchmark
+This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
+## 1. Model Collections
+For models hosted on HuggingFace, refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).
+For models hosted on ModelScope, refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).
+## 2. Environment Setup
+For inference using HuggingFace transformers:
+```shell
+conda create -n qwen_perf_transformers python=3.10
+conda activate qwen_perf_transformers
+pip install torch==2.3.1
+pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@v0.7.1
+pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.8
+pip install -r requirements-perf-transformers.txt
+```
+> [!Important]
+> - For `flash-attention`, you can use the prebulit wheels from [GitHub Releases](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.8) or installing from source, which requires a compatible CUDA compiler.
+>   - You don't actually need to install `flash-attention`. It has been intergrated into `torch` as a backend of `sdpa`.
+> - For `auto_gptq` to use efficent kernels, you need to install from source, because the prebuilt wheels require incompatible `torch` versions. Installing from source also requires a compatible CUDA compiler.
+> - For `autoawq` to use efficent kenerls, you need `autoawq-kernels`, which should be automatically installed. If not, run `pip install autoawq-kernels`.
+For inference using vLLM:
+```shell
+conda create -n qwen_perf_vllm python=3.10
+conda activate qwen_perf_vllm
+pip install -r requirements-perf-vllm.txt
+```
+## 3. Execute Tests
+Below are two methods for executing tests: using a script or the Speed Benchmark tool.
+### Method 1: Testing with Speed Benchmark Tool
+Use the Speed Benchmark tool developed by [EvalScope](https://github.com/modelscope/evalscope), which supports automatic model downloads from ModelScope and outputs test results. It also allows testing by specifying the model service URL. For details, please refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/speed_benchmark.html).
+**Install Dependencies**
+```shell
+pip install 'evalscope[perf]' -U
+```
+#### HuggingFace Transformers Inference
+Execute the command as follows:
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --attn-implementation flash_attention_2 \
+ --log-every-n-query 5 \
+ --connect-timeout 6000 \
+ --read-timeout 6000 \
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local \
+ --dataset speed_benchmark 
+```
+#### vLLM Inference
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --log-every-n-query 1 \
+ --connect-timeout 60000 \
+ --read-timeout 60000\
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local_vllm \
+ --dataset speed_benchmark
+```
+#### Parameter Explanation
+- `--parallel` sets the number of worker threads for concurrent requests, should be fixed at 1.
+- `--model` specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct.
+- `--attn-implementation` sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa.
+- `--log-every-n-query`: sets how often to log every n requests.
+- `--connect-timeout`: sets the connection timeout in seconds.
+- `--read-timeout`: sets the read timeout in seconds.
+- `--max-tokens`: sets the maximum output length in tokens.
+- `--min-tokens`: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048.
+- `--api`: sets the inference interface; local inference options are local|local_vllm.
+- `--dataset`: sets the test dataset; options are speed_benchmark|speed_benchmark_long.
+#### Test Results
+Test results can be found in the `outputs/{model_name}/{timestamp}/speed_benchmark.json` file, which contains all request results and test parameters.
+### Method 2: Testing with Scripts
+#### HuggingFace Transformers Inference
+- Using HuggingFace Hub
+```shell
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+```
+- Using ModelScope Hub
+```shell
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
+```
+Parameter Explanation:
+    `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
+    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
+    `--generate_length`: Number of tokens to generate; default is 2048
+    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`  
+    `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
+    `--outputs_dir`: Output directory, default is `outputs/transformers`  
+#### vLLM Inference
+- Using HuggingFace Hub
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+- Using ModelScope Hub
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+Parameter Explanation:
+    `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
+    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
+    `--generate_length`: Number of tokens to generate; default is 2048
+    `--max_model_len`: Maximum model length in tokens; default is 32768  
+    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`   
+    `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
+    `--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9  
+    `--outputs_dir`: Output directory, default is `outputs/vllm`  
+    `--enforce_eager`: Whether to enforce eager mode; default is False  
+#### Test Results
+Test results can be found in the `outputs` directory, which by default includes two folders for `transformers` and `vllm`, storing test results for HuggingFace transformers and vLLM respectively.
+## Notes
+1. Conduct multiple tests and take the average, with a typical value of 3 tests.
+2. Ensure the GPU is idle before testing to avoid interference from other tasks.
\ No newline at end of file
--- a/examples/speed-benchmark/README_zh.md
+++ b/examples/speed-benchmark/README_zh.md
+# 效率评估
+本文介绍Qwen2.5系列模型（原始模型和量化模型）的效率测试流程，详细报告可参考 [Qwen2.5模型效率评估报告](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html)。
+## 1. 模型资源
+对于托管在HuggingFace上的模型，可参考 [Qwen2.5模型-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e)。
+对于托管在ModelScope上的模型，可参考 [Qwen2.5模型-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768)。
+## 2. 环境安装
+使用HuggingFace transformers推理，安装环境如下：
+```shell
+conda create -n qwen_perf_transformers python=3.10
+conda activate qwen_perf_transformers
+pip install torch==2.3.1
+pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@v0.7.1
+pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.8
+pip install -r requirements-perf-transformers.txt
+```
+> [!Important]
+> - 对于 `flash-attention`，您可以从 [GitHub 发布页面](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.8) 使用预编译的 wheel 包进行安装，或者从源代码安装，后者需要一个兼容的 CUDA 编译器。
+>   - 实际上，您并不需要单独安装 `flash-attention`。它已经被集成到了 `torch` 中作为 `sdpa` 的后端实现。
+> - 若要使 `auto_gptq` 使用高效的内核，您需要从源代码安装，因为预编译的 wheel 包依赖于与之不兼容的 `torch` 版本。从源代码安装同样需要一个兼容的 CUDA 编译器。
+> - 若要使 `autoawq` 使用高效的内核，您需要安装 `autoawq-kernels`，该组件应当会自动安装。如果未自动安装，请运行 `pip install autoawq-kernels` 进行手动安装。
+使用vLLM推理，安装环境如下：
+```shell
+conda create -n qwen_perf_vllm python=3.10
+conda activate qwen_perf_vllm
+pip install -r requirements-perf-vllm.txt
+```
+## 3. 执行测试
+下面介绍两种执行测试的方法，分别是使用脚本测试和使用Speed Benchmark工具进行测试。
+### 方法1：使用Speed Benchmark工具测试
+使用[EvalScope](https://github.com/modelscope/evalscope)开发的Speed Benchmark工具进行测试，支持自动从modelscope下载模型并输出测试结果，也支持指定模型服务的url进行测试，具体请参考[📖使用指南](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/speed_benchmark.html)。
+**安装依赖**
+```shell
+pip install 'evalscope[perf]' -U
+```
+#### HuggingFace transformers推理
+执行命令如下：
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --attn-implementation flash_attention_2 \
+ --log-every-n-query 5 \
+ --connect-timeout 6000 \
+ --read-timeout 6000 \
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local \
+ --dataset speed_benchmark 
+```
+#### vLLM推理
+```shell
+CUDA_VISIBLE_DEVICES=0 evalscope perf \
+ --parallel 1 \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --log-every-n-query 1 \
+ --connect-timeout 60000 \
+ --read-timeout 60000\
+ --max-tokens 2048 \
+ --min-tokens 2048 \
+ --api local_vllm \
+ --dataset speed_benchmark
+```
+#### 参数说明
+- `--parallel` 设置并发请求的worker数量，需固定为1。
+- `--model` 测试模型文件路径，也可为模型ID，支持自动从modelscope下载模型，例如Qwen/Qwen2.5-0.5B-Instruct。
+- `--attn-implementation` 设置attention实现方式，可选值为flash_attention_2|eager|sdpa。
+- `--log-every-n-query`: 设置每n个请求打印一次日志。
+- `--connect-timeout`: 设置连接超时时间，单位为秒。
+- `--read-timeout`: 设置读取超时时间，单位为秒。
+- `--max-tokens`: 设置最大输出长度，单位为token。
+- `--min-tokens`: 设置最小输出长度，单位为token；两个参数同时设置为2048则模型固定输出长度为2048。
+- `--api`: 设置推理接口，本地推理可选值为local|local_vllm。
+- `--dataset`: 设置测试数据集，可选值为speed_benchmark|speed_benchmark_long。
+#### 测试结果
+测试结果详见`outputs/{model_name}/{timestamp}/speed_benchmark.json`文件，其中包含所有请求结果和测试参数。
+### 方法2：使用脚本测试
+#### HuggingFace transformers推理
+- 使用HuggingFace hub
+```shell
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+# 指定HF_ENDPOINT
+HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
+```
+- 使用ModelScope hub
+```shell
+python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
+```
+参数说明：
+    `--model_id_or_path`: 模型ID或本地路径， 可选值参考`模型资源`章节  
+    `--context_length`: 输入长度，单位为token数；可选值为1, 6144, 14336, 30720, 63488, 129024；具体可参考`Qwen2.5模型效率评估报告`  
+    `--generate_length`: 生成token数量；默认为2048
+    `--gpus`: 等价于环境变量CUDA_VISIBLE_DEVICES，例如`0,1,2,3`，`4,5`  
+    `--use_modelscope`: 如果设置该值，则使用ModelScope加载模型，否则使用HuggingFace  
+    `--outputs_dir`: 输出目录， 默认为`outputs/transformers`  
+#### vLLM推理
+- 使用HuggingFace hub
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+# 指定HF_ENDPOINT
+HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+- 使用ModelScope hub
+```shell
+python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+```
+参数说明：
+    `--model_id_or_path`: 模型ID或本地路径， 可选值参考`模型资源`章节  
+    `--context_length`: 输入长度，单位为token数；可选值为1, 6144, 14336, 30720, 63488, 129024；具体可参考`Qwen2.5模型效率评估报告`  
+    `--generate_length`: 生成token数量；默认为2048
+    `--max_model_len`: 模型最大长度，单位为token数；默认为32768  
+    `--gpus`: 等价于环境变量CUDA_VISIBLE_DEVICES，例如`0,1,2,3`，`4,5`   
+    `--use_modelscope`: 如果设置该值，则使用ModelScope加载模型，否则使用HuggingFace  
+    `--gpu_memory_utilization`: GPU内存利用率，取值范围为(0, 1]；默认为0.9  
+    `--outputs_dir`: 输出目录， 默认为`outputs/vllm`  
+    `--enforce_eager`: 是否强制使用eager模式；默认为False  
+#### 测试结果
+测试结果详见`outputs`目录下的文件，默认包括`transformers`和`vllm`两个目录，分别存放HuggingFace transformers和vLLM的测试结果。
+## 注意事项
+1. 多次测试，取平均值，典型值为3次
+2. 测试前请确保GPU处于空闲状态，避免其他任务影响测试结果
--- a/examples/speed-benchmark/requirements-perf-transformers.txt
+++ b/examples/speed-benchmark/requirements-perf-transformers.txt
+# Note: install following requirements saparately
+# pip install torch==2.3.1
+# pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@v0.7.1
+# pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.8
+transformers==4.46.0
+autoawq==0.2.6
+modelscope[framework]
+accelerate
+optimum>=1.20.0
--- a/examples/speed-benchmark/requirements-perf-vllm.txt
+++ b/examples/speed-benchmark/requirements-perf-vllm.txt
+vllm==0.6.3.post1
+torch==2.4.0
+modelscope[framework]
+accelerate
--- a/examples/speed-benchmark/speed_benchmark_transformers.py
+++ b/examples/speed-benchmark/speed_benchmark_transformers.py
+# Copyright (c) Alibaba Cloud.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Qwen2.5 Speed Benchmark for transformers(pt) inference.
+"""
+import os
+import time
+import json
+import csv
+import torch
+from transformers.trainer_utils import set_seed
+class SpeedBenchmarkTransformers:
+    SEED = 1024
+    BATCH_SIZE = 1
+    USE_FLASH_ATTN = True
+    COMMENT = 'default'
+    DEVICE_MAP = 'auto'
+    TORCH_DTYPE = 'auto'
+    OVERWRITE_RESULT = False
+    DUMMY_INPUT = '我'
+    def __init__(self, model_id_or_path, use_modelscope: bool = True, outputs_dir: str = 'outputs/transformers'):
+        """
+        Speed benchmark for transformers(pt) inference.
+        Args:
+            model_id_or_path: The model id on ModelScope or HuggingFace hub, or local model path.
+            use_modelscope: Use ModelScope, otherwise HuggingFace.
+            outputs_dir: The output directory. Default is 'outputs/transformers'.
+        """
+        set_seed(self.SEED)
+        self.model_id_or_path = model_id_or_path
+        self.outputs_dir = outputs_dir
+        if use_modelscope:
+            from modelscope import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+        else:
+            from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
+        self.tokenizer = AutoTokenizer.from_pretrained(model_id_or_path, trust_remote_code=True)
+        attn_impl = 'flash_attention_2' if self.USE_FLASH_ATTN else 'eager'
+        self.model = AutoModelForCausalLM.from_pretrained(model_id_or_path,
+                                                          torch_dtype=self.TORCH_DTYPE,
+                                                          device_map=self.DEVICE_MAP,
+                                                          attn_implementation=attn_impl
+                                                          ).eval()
+        self.generation_config = GenerationConfig.from_pretrained(model_id_or_path, trust_remote_code=True)
+    def run(self, context_length: int, generate_length: int) -> str:
+        # Specify hyperparameters for generation
+        self.generation_config.min_length = generate_length + context_length
+        self.generation_config.max_new_tokens = generate_length
+        print(f'Generation config: {self.generation_config}')
+        # Prepare inputs
+        batch_size = self.BATCH_SIZE
+        context_str = self.DUMMY_INPUT * context_length
+        inputs = self.tokenizer([context_str for _ in range(batch_size)], return_tensors='pt')
+        assert inputs['input_ids'].shape[1] == context_length
+        assert inputs['input_ids'].shape[0] == batch_size
+        inputs = inputs.to(self.model.device)
+        # Run inference
+        print(f'Start running inference for model {self.model_id_or_path} with input length {context_length} ...')
+        start_time = time.time()
+        torch.cuda.synchronize()
+        pred = self.model.generate(**inputs, generation_config=self.generation_config)
+        torch.cuda.synchronize()
+        time_cost = time.time() - start_time
+        assert pred.shape[1] == self.generation_config.min_length
+        m = 0
+        max_gpu_memory_cost = 0
+        for i in range(torch.cuda.device_count()):
+            m += torch.cuda.max_memory_allocated(i)
+        max_gpu_memory_cost = max(max_gpu_memory_cost, m)
+        torch.cuda.empty_cache()
+        # Prepare results
+        tokens_per_second: float = generate_length / time_cost
+        # Compute the maximum GPU memory cost (in GB)
+        max_gpu_memory_cost_gb = max_gpu_memory_cost / 1024 / 1024 / 1024
+        data = {
+            "model_id_or_path": self.model_id_or_path,
+            "batch_size": batch_size,
+            "context_length_per_experiment": context_length,
+            "generate_length_per_experiment": generate_length,
+            "use_flash_attn": self.USE_FLASH_ATTN,
+            "comment": self.COMMENT,
+            "tokens_per_second": round(tokens_per_second, 4),
+            "max_gpu_memory_cost_gb": round(max_gpu_memory_cost_gb, 4),
+        }
+        data_json = json.dumps(data, indent=4, ensure_ascii=False)
+        print(f'**Final result **\n{data_json}\n')
+        # Dump results to CSV file
+        from datetime import datetime
+        now = datetime.now()
+        timestamp: str = now.strftime("%m%d%H%M%S")
+        model_id_or_path_str = self.model_id_or_path.split(os.sep)[-1] \
+            if os.path.isdir(self.model_id_or_path) else self.model_id_or_path.replace('/', '__')
+        out_file: str = os.path.join(self.outputs_dir,
+                                     f"{model_id_or_path_str}"
+                                     f"_context_length-{context_length}_{timestamp}.csv")
+        out_dir = os.path.dirname(out_file)
+        os.makedirs(out_dir, exist_ok=True)
+        self.save_result(data, out_file)
+        return out_file
+    @staticmethod
+    def save_result(data: dict, out_file: str) -> None:
+        with open(out_file, mode='w') as file:
+            writer = csv.DictWriter(file, fieldnames=data.keys())
+            writer.writeheader()
+            writer.writerows([data])
+        print(f"Results saved to {out_file}")
+def main():
+    import argparse
+    # Parse args
+    parser = argparse.ArgumentParser(description='Speed benchmark for transformers(pt) deployment')
+    parser.add_argument('--model_id_or_path', type=str, help='The model path or id on ModelScope or HuggingFace hub')
+    parser.add_argument('--context_length', type=int, help='The input length for each experiment.'
+                                                           'e.g. 1, 6144, 14336, 30720, 63488, 129024')
+    parser.add_argument('--generate_length', type=int, default=2048, help='Output length in tokens; default is 2048.')
+    parser.add_argument('--gpus', type=str, help='Equivalent to the env CUDA_VISIBLE_DEVICES.  e.g. `0,1,2,3`, `4,5`')
+    parser.add_argument('--use_modelscope', action='store_true',
+                        help='Use ModelScope when set this flag. Otherwise, use HuggingFace.')
+    parser.add_argument('--outputs_dir', type=str, default='outputs/transformers', help='The output directory')
+    args = parser.parse_args()
+    model_id_or_path: str = args.model_id_or_path
+    envs: str = args.gpus
+    context_length: int = args.context_length
+    generate_length: int = args.generate_length
+    use_modelscope: bool = args.use_modelscope
+    outputs_dir: str = args.outputs_dir
+    print(f'Set CUDA_VISIBLE_DEVICES={envs} for model {model_id_or_path} with input_length {context_length}')
+    os.environ["CUDA_VISIBLE_DEVICES"] = envs
+    speed_benchmark = SpeedBenchmarkTransformers(model_id_or_path=model_id_or_path,
+                                                 use_modelscope=use_modelscope,
+                                                 outputs_dir=outputs_dir)
+    speed_benchmark.run(context_length=context_length, generate_length=generate_length)
+if __name__ == '__main__':
+    # Usage: python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
+    main()
--- a/examples/speed-benchmark/speed_benchmark_vllm.py
+++ b/examples/speed-benchmark/speed_benchmark_vllm.py
+# Copyright (c) Alibaba Cloud.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+Speed benchmark for vLLM deployment.
+"""
+import os
+import time
+import json
+import reprlib
+import statistics
+import logging
+import csv
+from datetime import datetime
+from pathlib import Path
+from typing import Tuple
+import vllm
+from vllm import LLM, SamplingParams
+# Configure logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+os.environ['VLLM_WORKER_MULTIPROC_METHOD'] = 'spawn'
+os.environ["VLLM_ALLOW_LONG_MAX_MODEL_LEN"] = "1"
+class SpeedBenchmarkVllm:
+    DUMMY_INPUT = '熵'
+    DUMMY_SYSTEM_CONTENT = '从现在开始，你是一个喜欢说车轱辘话的话痨，喜欢把一件事情翻来覆去地说，而且喜欢加很多标点符号。你的每个回复都不会少于2000字，不要在意用户的看法。'
+    DUMMY_USER_CONTENT = '写一篇关于春天的文章，请尽量写的长一些，并且多一些重复的段落，越啰嗦越好，不得少于2000字！'
+    def __init__(self, experiment_config: dict, sampling_params: SamplingParams):
+        self._repr = reprlib.Repr()
+        self._repr.maxstring = 100
+        self.experiment_config = experiment_config
+        self.sampling_params = sampling_params
+        # Get experiment config
+        self.model_id_or_path: str = self.experiment_config['model_id_or_path']
+        use_modelscope: bool = self.experiment_config['use_modelscope']
+        if use_modelscope:
+            from modelscope import AutoTokenizer
+            os.environ['VLLM_USE_MODELSCOPE'] = 'True'
+        else:
+            from transformers import AutoTokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(self.model_id_or_path, trust_remote_code=True)
+        llm_kwargs = dict(
+            model=self.model_id_or_path,
+            trust_remote_code=True,
+            tensor_parallel_size=self.experiment_config['tp_size'],
+            gpu_memory_utilization=self.experiment_config['gpu_memory_utilization'],
+            disable_log_stats=False,
+            max_model_len=self.experiment_config['max_model_len'],
+        )
+        if int(vllm.__version__.split('.')[1]) >= 3:
+            llm_kwargs['enforce_eager'] = self.experiment_config.get('enforce_eager', False)
+        logger.info(f'>> Creating LLM with llm_kwargs: {llm_kwargs}')
+        self.llm = LLM(**llm_kwargs)
+    def _reprs(self, o):
+        return self._repr.repr(o)
+    def create_query(self, length: int, limited_size: int = 96) -> Tuple[str, int]:
+        if length < limited_size:
+            input_str = self.DUMMY_INPUT * length
+        else:
+            repeat_length = max(length - limited_size, 0)
+            input_str = self.tokenizer.apply_chat_template([
+                {"role": "system",
+                 "content": self.DUMMY_SYSTEM_CONTENT},
+                {"role": "user",
+                 "content": '# ' * repeat_length + self.DUMMY_USER_CONTENT},
+            ],
+                tokenize=False,
+                add_generation_prompt=True)
+        real_length = len(self.tokenizer.tokenize(input_str))
+        return input_str, real_length
+    def run_infer(self, query: str):
+        start_time = time.time()
+        output = self.llm.generate([query], self.sampling_params)[0]
+        time_cost = time.time() - start_time
+        generated_text = output.outputs[0].text
+        real_out_length = len(self.tokenizer.tokenize(generated_text))
+        return time_cost, real_out_length, generated_text
+    def run(self):
+        context_length: int = self.experiment_config['context_length']
+        output_len: int = self.experiment_config['output_len']
+        # Construct input query
+        query, real_length = self.create_query(length=context_length)
+        logger.info(f'Got input query length: {real_length}')
+        logger.info(f"Warmup run with {self.experiment_config['warmup']} iterations ...")
+        for _ in range(self.experiment_config['warmup']):
+            self.llm.generate([query], self.sampling_params)
+        logger.info(f"Running inference with real length {real_length}, "
+                    f"out length {output_len}, "
+                    f"tp_size {self.experiment_config['tp_size']} ...")
+        time_cost, real_out_length, generated_text = self.run_infer(query)
+        if real_out_length < output_len:
+            logger.warning(f'Generate result {real_out_length} too short, try again ...')
+            query, real_length = self.create_query(length=context_length,
+                                                   limited_size=context_length + 1)
+            time_cost, real_out_length, generated_text = self.run_infer(query)
+        time_cost = round(time_cost, 4)
+        logger.info(f'Inference time cost: {time_cost}s')
+        logger.info(f'Input({real_length}): {self._reprs(query)}')
+        logger.info(f'Output({real_out_length}): {self._reprs(generated_text)}')
+        results: dict = self.collect_statistics(self.model_id_or_path,
+                                                [time_cost, time_cost],
+                                                output_len,
+                                                context_length,
+                                                self.experiment_config['tp_size'])
+        self.print_table(results)
+        # Dump results to CSV file
+        outputs_dir = Path(self.experiment_config['outputs_dir'])
+        outputs_dir.mkdir(parents=True, exist_ok=True)
+        now = datetime.now()
+        timestamp: str = now.strftime("%m%d%H%M%S")
+        model_id_or_path_str = self.model_id_or_path.split(os.sep)[-1] \
+            if os.path.isdir(self.model_id_or_path) else self.model_id_or_path.replace('/', '__')
+        out_file: str = os.path.join(outputs_dir,
+                                     f"{model_id_or_path_str}"
+                                     f"_context_length-{context_length}_{timestamp}.csv")
+        self.save_result(results, out_file)
+    @staticmethod
+    def collect_statistics(model_id_or_path, data, out_length, in_length, tp_size) -> dict:
+        avg_time = statistics.mean(data)
+        throughput_data = [out_length / t for t in data]
+        avg_throughput = statistics.mean(throughput_data)
+        results = {
+            'Model ID': model_id_or_path,
+            'Input Length': in_length,
+            'Output Length': out_length,
+            'TP Size': tp_size,
+            'Average Time (s)': round(avg_time, 4),
+            'Average Throughput (tokens/s)': round(avg_throughput, 4),
+        }
+        return results
+    @staticmethod
+    def print_table(results):
+        json_res = json.dumps(results, indent=4, ensure_ascii=False)
+        logger.info(f"Final results:\n{json_res}")
+    @staticmethod
+    def save_result(data: dict, out_file: str) -> None:
+        with open(out_file, mode='w') as file:
+            writer = csv.DictWriter(file, fieldnames=data.keys())
+            writer.writeheader()
+            writer.writerows([data])
+        logger.info(f"Results saved to {out_file}")
+def main():
+    import argparse
+    # Define command line arguments
+    parser = argparse.ArgumentParser(description='Speed benchmark for vLLM deployment')
+    parser.add_argument('--model_id_or_path', type=str, help='The model id on ModelScope or HuggingFace hub')
+    parser.add_argument('--context_length', type=int, help='The context length for each experiment, '
+                                                           'e.g. 1, 6144, 14336, 30720, 63488, 129024')
+    parser.add_argument('--generate_length', type=int, default=2048, help='Output length in tokens; default is 2048.')
+    parser.add_argument('--gpus', type=str, help='Equivalent to the env CUDA_VISIBLE_DEVICES.  e.g. `0,1,2,3`, `4,5`')
+    parser.add_argument('--gpu_memory_utilization', type=float, default=0.9, help='GPU memory utilization')
+    parser.add_argument('--max_model_len', type=int, default=32768, help='The maximum model length, '
+                                                                         'e.g. 4096, 8192, 32768, 65536, 131072')
+    parser.add_argument('--enforce_eager', action='store_true', help='Enforce eager mode for vLLM')
+    parser.add_argument('--outputs_dir', type=str, default='outputs/vllm', help='The output directory')
+    parser.add_argument('--use_modelscope', action='store_true',
+                        help='Use ModelScope when set this flag. Otherwise, use HuggingFace.')
+    # Parse args
+    args = parser.parse_args()
+    # Parse args
+    model_id_or_path: str = args.model_id_or_path
+    context_length: int = args.context_length
+    output_len: int = args.generate_length
+    envs: str = args.gpus
+    gpu_memory_utilization: float = args.gpu_memory_utilization
+    max_model_len: int = args.max_model_len
+    enforce_eager: bool = args.enforce_eager
+    outputs_dir = args.outputs_dir
+    use_modelscope: bool = args.use_modelscope
+    # Set vLLM sampling params
+    sampling_params = SamplingParams(
+        temperature=1.0,
+        top_p=0.8,
+        top_k=-1,
+        repetition_penalty=0.1,
+        presence_penalty=-2.0,
+        frequency_penalty=-2.0,
+        max_tokens=output_len,
+    )
+    # Set experiment config
+    experiment_config: dict = {
+        'model_id_or_path': model_id_or_path,
+        'context_length': context_length,
+        'output_len': output_len,
+        'tp_size': len(envs.split(',')),
+        'gpu_memory_utilization': gpu_memory_utilization,
+        'max_model_len': max_model_len,
+        'enforce_eager': enforce_eager,
+        'envs': envs,
+        'outputs_dir': outputs_dir,
+        'warmup': 0,
+        'use_modelscope': use_modelscope,
+    }
+    logger.info(f'Sampling params: {sampling_params}')
+    logger.info(f'Experiment config: {experiment_config}')
+    logger.info(f'Set CUDA_VISIBLE_DEVICES={envs} for model {model_id_or_path} with context_length {context_length}')
+    os.environ["CUDA_VISIBLE_DEVICES"] = envs
+    speed_benchmark_vllm = SpeedBenchmarkVllm(experiment_config=experiment_config, sampling_params=sampling_params)
+    speed_benchmark_vllm.run()
+if __name__ == '__main__':
+    # Usage: python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+    # HF_ENDPOINT=https://hf-mirror.com python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
+    main()