README.md

# Speed Benchmark

This document introduces the speed benchmark testing process for the Qwen2.5 series models (original and quantized models). For detailed reports, please refer to the [Qwen2.5 Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).

## 1. Model Collections

For models hosted on HuggingFace, refer to [Qwen2.5 Collections-HuggingFace](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e).

For models hosted on ModelScope, refer to [Qwen2.5 Collections-ModelScope](https://modelscope.cn/collections/Qwen25-dbc4d30adb768).

## 2. Environment Setup


For inference using HuggingFace transformers:

```shell
conda create -n qwen_perf_transformers python=3.10
conda activate qwen_perf_transformers

pip install torch==2.3.1
pip install git+https://github.com/AutoGPTQ/AutoGPTQ.git@v0.7.1
pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.5.8
pip install -r requirements-perf-transformers.txt
```

> [!Important]
> - For `flash-attention`, you can use the prebulit wheels from [GitHub Releases](https://github.com/Dao-AILab/flash-attention/releases/tag/v2.5.8) or installing from source, which requires a compatible CUDA compiler.
>   - You don't actually need to install `flash-attention`. It has been intergrated into `torch` as a backend of `sdpa`.
> - For `auto_gptq` to use efficent kernels, you need to install from source, because the prebuilt wheels require incompatible `torch` versions. Installing from source also requires a compatible CUDA compiler.
> - For `autoawq` to use efficent kenerls, you need `autoawq-kernels`, which should be automatically installed. If not, run `pip install autoawq-kernels`.

For inference using vLLM:

```shell
conda create -n qwen_perf_vllm python=3.10
conda activate qwen_perf_vllm

pip install -r requirements-perf-vllm.txt
```

## 3. Execute Tests

Below are two methods for executing tests: using a script or the Speed Benchmark tool.

### Method 1: Testing with Speed Benchmark Tool

Use the Speed Benchmark tool developed by [EvalScope](https://github.com/modelscope/evalscope), which supports automatic model downloads from ModelScope and outputs test results. It also allows testing by specifying the model service URL. For details, please refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/speed_benchmark.html).

**Install Dependencies**
```shell
pip install 'evalscope[perf]' -U
```

#### HuggingFace Transformers Inference

Execute the command as follows:
```shell
CUDA_VISIBLE_DEVICES=0 evalscope perf \
 --parallel 1 \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --attn-implementation flash_attention_2 \
 --log-every-n-query 5 \
 --connect-timeout 6000 \
 --read-timeout 6000 \
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api local \
 --dataset speed_benchmark 
```

#### vLLM Inference

```shell
CUDA_VISIBLE_DEVICES=0 evalscope perf \
 --parallel 1 \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --log-every-n-query 1 \
 --connect-timeout 60000 \
 --read-timeout 60000\
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api local_vllm \
 --dataset speed_benchmark
```

#### Parameter Explanation
- `--parallel` sets the number of worker threads for concurrent requests, should be fixed at 1.
- `--model` specifies the model file path or model ID, supporting automatic downloads from ModelScope, e.g., Qwen/Qwen2.5-0.5B-Instruct.
- `--attn-implementation` sets the attention implementation method, with optional values: flash_attention_2|eager|sdpa.
- `--log-every-n-query`: sets how often to log every n requests.
- `--connect-timeout`: sets the connection timeout in seconds.
- `--read-timeout`: sets the read timeout in seconds.
- `--max-tokens`: sets the maximum output length in tokens.
- `--min-tokens`: sets the minimum output length in tokens; both parameters set to 2048 means the model will output a fixed length of 2048.
- `--api`: sets the inference interface; local inference options are local|local_vllm.
- `--dataset`: sets the test dataset; options are speed_benchmark|speed_benchmark_long.

#### Test Results

Test results can be found in the `outputs/{model_name}/{timestamp}/speed_benchmark.json` file, which contains all request results and test parameters.

### Method 2: Testing with Scripts

#### HuggingFace Transformers Inference

- Using HuggingFace Hub

```shell
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --outputs_dir outputs/transformers
```

- Using ModelScope Hub

```shell
python speed_benchmark_transformers.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --gpus 0 --use_modelscope --outputs_dir outputs/transformers
```

Parameter Explanation:

    `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
    `--generate_length`: Number of tokens to generate; default is 2048
    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`  
    `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
    `--outputs_dir`: Output directory, default is `outputs/transformers`  

#### vLLM Inference

- Using HuggingFace Hub

```shell
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
```

- Using ModelScope Hub

```shell
python speed_benchmark_vllm.py --model_id_or_path Qwen/Qwen2.5-0.5B-Instruct --context_length 1 --max_model_len 32768 --gpus 0 --use_modelscope --gpu_memory_utilization 0.9 --outputs_dir outputs/vllm
```

Parameter Explanation:

    `--model_id_or_path`: Model ID or local path, optional values refer to the `Model Resources` section  
    `--context_length`: Input length in tokens; optional values are 1, 6144, 14336, 30720, 63488, 129024; refer to the `Qwen2.5 Model Efficiency Evaluation Report` for specifics  
    `--generate_length`: Number of tokens to generate; default is 2048
    `--max_model_len`: Maximum model length in tokens; default is 32768  
    `--gpus`: Equivalent to the environment variable CUDA_VISIBLE_DEVICES, e.g., `0,1,2,3`, `4,5`   
    `--use_modelscope`: If set, uses ModelScope to load the model; otherwise, uses HuggingFace  
    `--gpu_memory_utilization`: GPU memory utilization, range (0, 1]; default is 0.9  
    `--outputs_dir`: Output directory, default is `outputs/vllm`  
    `--enforce_eager`: Whether to enforce eager mode; default is False  

#### Test Results

Test results can be found in the `outputs` directory, which by default includes two folders for `transformers` and `vllm`, storing test results for HuggingFace transformers and vLLM respectively.

## Notes

1. Conduct multiple tests and take the average, with a typical value of 3 tests.
2. Ensure the GPU is idle before testing to avoid interference from other tasks.