# Benchmark
## 目录
- [参数设置](#参数设置)
- [量化](#量化)
- [Model Type & Max Length](#model-type--max-length)
- [Batch Size](#batch-size)
- [Use Flash Attn & Gradient Checkpointing](#use-flash-attn--gradient-checkpointing)
- [LoRA Rank & LoRA Target Modules](#lora-rank--lora-target-modules)
- [Gradient Accumulation Steps](#gradient-accumulation-steps)
- [Tuners](#Tuners)
- [Export](#Export)
- [AWQ](#AWQ)
- [AQLM](#AQLM)
- [Sequence Parallel](#Sequence-Parallel)
## 参数设置
实验环境:
- A100
- CUDA 11.8
- python 3.10
- torch 2.1.1
- flash_attn 2.3.4
- xformers 0.0.23
- auto_gptq 0.5.1
- bitsandbytes 0.41.3.post2
以下为所有实验的相同命令行设置部分:
```bash
--dataset_test_ratio 0 \
--dataset cls-fudan-news-zh \
--save_strategy no \
--check_dataset_strategy warning \
--preprocess_num_proc 4 \
```
如果未指定以下参数, 则使用以下默认值:
```bash
--max_length 2048 \
--batch_size 1 \
--gradient_checkpointing true \
--use_flash_attn true \
--lora_rank 8 \
--lora_target_modules DEFAULT \
--quantization_bit 0 \
--gradient_accumulation_steps 16 \
```
对应测试数据集的token数统计量(由qwen的tokenizer获取): 3234.4±2547.5, min=91, max=19548.
实验使用脚本可以查看`scripts/benchmark/test_memory_time/`.
## 量化
测试脚本为:
```bash
swift sft \
--model_type {MODEL_TYPE} \
--quantization_bit {QUANTIZATION_BIT} \
--sft_type lora \
...
```
| Model Type [LoRA] |
Quantization |
Training Speed (samples/s) |
GPU Memory (GiB) |
| qwen-7b-chat |
bf16 |
4.31 |
27.74 |
| int4 (gptq) |
2.05 |
19.21 |
| int8 (gptq) |
1.97 |
22.20 |
| int4 (bnb) |
2.41 |
23.85 |
| qwen-14b-chat |
bf16 |
2.60 |
40.14 |
| int4 (gptq) |
1.15 |
23.30 |
| int8 (gptq) |
1.08 |
29.13 |
| int4 (bnb) |
1.36 |
30.05 |
| qwen-72b-chat |
bf16 |
0.59 (2*A100) |
73.71+78.54 |
| int4 (gptq) |
0.23 |
54.86 |
| int8 (gptq) |
0.21 |
78.44 |
| int4 (bnb) |
0.28 |
74.87 |
## Model Type & Max Length
### LoRA
测试脚本为:
```bash
swift sft \
--model_type {MODEL_TYPE} \
--max_length {MAX_LENGTH} \
--sft_type lora \
...
```
| Model Type [LoRA] |
Max Length |
Training Speed (samples/s) |
GPU Memory (GiB) |
| qwen-1_8b-chat |
512 |
9.88 |
6.99 |
| 1024 |
9.90 |
10.71 |
| 2048 |
8.77 |
16.35 |
| 4096 |
5.92 |
23.80 |
| 8192 |
4.19 |
37.03 |
| qwen-7b-chat |
512 |
7.43 |
18.01 |
| 1024 |
6.51 |
21.73 |
| 2048 |
4.31 |
27.74 |
| 4096 |
2.05 |
35.31 |
| 8192 |
1.34 |
48.41 |
| qwen-14b-chat |
512 |
5.63 |
30.14 |
| 1024 |
4.36 |
34.43 |
| 2048 |
2.60 |
40.14 |
| 4096 |
1.17 |
47.95 |
| 8192 |
0.79 |
60.74 |
| qwen-72b-chat (2*A100) |
512 |
1.41 |
67.68+73.07 |
| 1024 |
1.02 |
70.25+77.11 |
| 2048 |
0.59 |
73.71+78.54 |
| 4096 |
- |
OOM |
| 8192 |
- |
OOM |
| chatglm3-6b |
512 |
6.72 |
13.94 |
| 1024 |
6.16 |
12.99 |
| 2048 |
4.20 |
17.20 |
| 4096 |
1.92 |
29.80 |
| 8192 |
1.24 |
66.82 |
| yi-6b-chat |
512 |
5.27 |
13.72 |
| 1024 |
5.07 |
15.44 |
| 2048 |
3.84 |
16.95 |
| 4096 |
1.99 |
28.25 |
| 8192 |
1.35 |
43.81 |
| yi-34b-chat |
512 |
2.32 |
66.72 |
| 1024 |
1.76 |
69.10 |
| 2048 |
1.05 |
71.34 |
| 4096 |
0.47 |
78.72 |
| 8192 |
0.31 (2*A100) |
47.01+65.03 |
| openbuddy-zephyr-7b-chat |
512 |
5.17 |
14.99 |
| 1024 |
3.92 |
16.57 |
| 2048 |
3.08 |
19.89 |
| 4096 |
1.85 |
23.29 |
| 8192 |
0.92 |
52.14 |
| baichuan2-7b-chat |
512 |
6.09 |
18.18 |
| 1024 |
5.36 |
17.45 |
| 2048 |
3.43 |
19.18 |
| 4096 |
1.69 |
34.22 |
| 8192 |
1.16 |
45.47 |
| baichuan2-13b-chat |
512 |
5.32 |
31.01 |
| 1024 |
3.91 |
31.58 |
| 2048 |
1.77 |
32.40 |
| 4096 |
0.65 |
49.63 |
| 8192 |
0.36 |
76.17 |
### Full
测试脚本为:
```bash
swift sft \
--model_type {MODEL_TYPE} \
--max_length {MAX_LENGTH} \
--sft_type full \
...
```
| Model Type [FULL] |
Max Length |
Training Speed (samples/s) |
GPU Memory (GiB) |
| qwen-1_8b-chat |
512 |
10.77 |
18.16 |
| 1024 |
10.39 |
18.62 |
| 2048 |
8.73 |
35.11 |
| 4096 |
5.45 |
31.62 |
| 8192 |
3.81 |
38.93 |
| qwen-7b-chat |
512 |
5.96 |
73.37 |
| 1024 |
5.00 |
73.64 |
| 2048 |
3.30 |
74.26 |
| 4096 |
1.64 |
78.76 |
| 8192 |
1.11 (2*A100) |
61.34+73.00 |
| qwen-14b-chat (2*A100) |
512 |
3.66 |
60.42+72.31 |
| 1024 |
2.98 |
60.61+74.37 |
| 2048 |
1.93 |
60.70+78.22 |
| 4096 |
0.92 |
75.59+78.64 |
| 8192 |
0.62 |
76.59+77.68 |
## Batch Size
测试脚本为:
```bash
swift sft \
--batch_size {BATCH_SIZE} \
--model_type qwen-7b-chat \
--sft_type lora \
...
```
| Model Type [LoRA] |
Batch Size |
Training Speed (samples/s) |
GPU Memory (GiB) |
| qwen-7b-chat |
1 |
4.31 |
27.74 |
| 2 |
3.60 |
43.11 |
| 4 |
3.02 |
63.81 |
| 8 |
2.77 |
76.14 |
## Use Flash Attn & Gradient Checkpointing
测试脚本为:
```bash
swift sft \
--use_flash_attn {USE_FLASH_ATTN} \
--gradient_checkpointing {GRADIENT_CHECKPOINTING} \
--model_type qwen-7b-chat \
--sft_type lora \
...
```
| Model Type [LoRA] |
Use Flash Attn |
Gradient Checkpointing |
Training Speed (samples/s) |
GPU Memory (GiB) |
| qwen-7b-chat |
✔ |
✔ |
4.31 |
27.74 |
| ✔ |
✘ |
6.19 |
37.70 |
| ✘ |
✔ |
3.13 |
27.71 |
| ✘ |
✘ |
4.45 |
57.67 |
## LoRA Rank & LoRA Target Modules
测试脚本为:
```bash
swift sft \
--lora_rank {LORA_RANK} \
--lora_target_modules {LORA_TARGET_MODULES} \
--model_type qwen-7b-chat \
--sft_type lora \
...
```
| Model Type [LoRA] |
LoRA Rank |
LoRA Target Modules |
Training Speed (samples/s) |
GPU Memory (GiB) |
Trainable Params (M) |
| qwen-7b-chat |
2 |
DEFAULT (c_attn) |
4.27 |
27.72 |
1.05 |
| 8 |
DEFAULT |
4.31 |
27.74 |
4.19 |
| 64 |
DEFAULT |
4.19 |
27.85 |
33.55 |
| 8 |
ALL (all linear) |
3.22 |
27.87 |
17.89 |
## Gradient Accumulation Steps
测试脚本为:
```bash
swift sft \
--gradient_accumulation_steps {GRADIENT_ACCUMULATION_STEPS} \
--model_type qwen-7b-chat \
--sft_type lora \
...
```
| Model Type [LoRA] |
Gradient Accumulation Steps |
Training Speed (samples/s) |
GPU Memory (GiB) |
| qwen-7b-chat |
1 |
4.26 |
27.73 |
| 2 |
4.32 |
27.74 |
| 4 |
4.31 |
27.74 |
| 8 |
4.32 |
27.74 |
| 16 |
4.33 |
27.74 |
| 32 |
4.30 |
27.74 |
| 64 |
4.32 |
27.74 |
## Tuners
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
|adalora|qwen-7b-chat|ms-agent|2.0|adalora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|26.8389(0.3464%)|True|True|lr=5e-05/epoch=2|32.55GiB|0.92(87543 samples/95338.71 seconds)|17.33(2345 tokens/135.29 seconds)|0.57|1.07|0.391|0.665|0.569|
|adapter|qwen-7b-chat|ms-agent|2.0|adapter||33.6896(0.4344%)|True|True|lr=5e-05/epoch=2|32.19GiB|1.48(87543 samples/59067.71 seconds)|26.63(4019 tokens/150.90 seconds)|0.55|1.03|0.438|0.662|0.565|
|dora|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=True|19.2512(0.2487%)|True|True|lr=5e-05/epoch=2|32.46GiB|0.51(87543 samples/171110.54 seconds)|4.29(2413 tokens/562.32 seconds)|0.53|1.01|0.466|0.683|**0.577**|
|full+galore128|qwen-7b-chat|ms-agent|2.0|full|galore_rank=128/galore_per_parameter=false/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|47.02GiB|1.10(87543 samples/79481.96 seconds)|28.96(2400 tokens/82.88 seconds)|0.55|1.00|0.358|**0.688**|**0.577**|
|full+galore32|qwen-7b-chat|ms-agent|2.0|full|galore_rank=32/galore_per_parameter=false/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|47.05GiB|1.11(87543 samples/78989.74 seconds)|29.17(2431 tokens/83.35 seconds)|0.56|1.01|0.386|0.667|0.539|
|full+galore64|qwen-7b-chat|ms-agent|2.0|full|galore_rank=64/galore_per_parameter=false/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|46.91GiB|1.11(87543 samples/79200.36 seconds)|28.94(2448 tokens/84.60 seconds)|0.56|1.01|0.397|0.674|0.544|
|full+galore_emb|qwen-7b-chat|ms-agent|2.0|full|galore_rank=128/galore_per_parameter=false/galore_with_embedding=true|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|44.53GiB|1.10(87543 samples/79775.02 seconds)|29.45(2433 tokens/82.62 seconds)|0.55|1.00|0.398|0.670|0.568|
|full+galore_perparam|qwen-7b-chat|ms-agent|2.0|full|galore_rank=128/galore_per_parameter=true/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|47.02GiB|1.25(87543 samples/69821.89 seconds)|29.02(2478 tokens/85.39 seconds)|0.54|1.00|0.372|0.669|0.524|
|full+no_mix|qwen-7b-chat|ms-agent|0.0|full||7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|72.56GiB|1.27(29698 samples/23356.97 seconds)|30.31(11738 tokens/387.29 seconds)|0.57|**0.44**|0.174|0.652|0.553|
|full|qwen-7b-chat|ms-agent|2.0|full||7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|73.53GiB|1.43(87543 samples/61022.97 seconds)|29.51(3382 tokens/114.62 seconds)|0.54|0.95|0.343|0.536|0.495|
|llamapro|qwen-7b-chat|ms-agent|2.0|llamapro|num_blocks=4|809.5826(9.4900%)|True|True|lr=5e-05/epoch=2|38.11GiB|1.53(87543 samples/57294.42 seconds)|25.80(2374 tokens/92.02 seconds)|0.53|1.00|0.434|0.645|0.357|
|lora+|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=16.0/use_rslora=False/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.95(87543 samples/91923.80 seconds)|18.81(3329 tokens/176.94 seconds)|0.53|0.98|0.432|0.647|0.344|
|lora+neftune|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/neftune_noise_alpha=15.0|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.96(87543 samples/91525.50 seconds)|19.84(161792 tokens/8156.02 seconds)|0.53|1.02|0.456|0.671|0.401|
|lora+no_mix|qwen-7b-chat|ms-agent|0.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|30.86GiB|0.91(29698 samples/32570.15 seconds)|19.89(36308 tokens/1825.26 seconds)|0.53|0.53|0.470|0.666|0.574|
|lora|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.95(87543 samples/91974.29 seconds)|18.11(2415 tokens/133.32 seconds)|0.53|1.01|0.462|0.676|0.304|
|qwen-7b-chat-eval|qwen-7b-chat|None|0.0|None||None(None)||||None||30.81(13765 tokens/446.83 seconds)|||**0.517**|0.679|0.568|
|rslora|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=True/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.94(87543 samples/92758.63 seconds)|18.87(2762 tokens/146.34 seconds)|**0.53**|0.99|0.451|0.679|0.339|
| full+lisa_2 | qwen-7b-chat | ms-agent | 2.0 | full | lisa_activated_layers=2/lisa_step_interval=20 | - | True | True | lr=5e-05/epoch=2 | 31.11GiB | 2.66(76837 samples/28881.28 seconds) | 36.10(134469 tokens/3725.21 seconds) | 0.62 | 1.06 | 0.349 | 0.653 | 0.592 |
| full+lisa_4 | qwen-7b-chat | ms-agent | 2.0 | full | lisa_activated_layers=4/lisa_step_interval=20 | - | True | True | lr=5e-05/epoch=2 | 31.87GiB | 2.63(76837 samples/29215.15 seconds) | 36.75(135477 tokens/3686.17 seconds) | 0.63 | 1.06 | 0.377 | 0.656 | **0.607** |
|lora+packing+ddp|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|35.65GiB*2|1.56(7900 samples/5057.30 seconds)|26.20(421094 tokens/16073.09 seconds)|0.63|0.98|0.473|0.664|0.552|
|lora+packing+lazytokenize|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.83GiB|7.69(78237 samples/10179.40 seconds)|25.86(307390 tokens/11888.17 seconds)|0.63|1.04|0.472|0.660|0.554|
|lora+packing|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|28.06GiB|0.79(7900 samples/10048.53 seconds)|26.12(409507 tokens/15675.36 seconds)|0.61|0.95|0.492|0.676|0.539|
## unsloth
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| --------------- | ------------------ | -------- | ------------------ | ----- | ------------ | ------------------- | ---------- | ---------------------- | ---------------- | -------- | ------------------------------------ | ------------------------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
| unsloth+lora+q4 | llama3-8b-instruct | ms-agent | 2.0 | lora | | 4.7186(0.1038%) | True | True | lr=5e-05/epoch=2 | 21.69GiB | 1.76(76839 samples/43763.01 seconds) | 15.22(160885 tokens/10570.90 seconds) | 0.58 | 1.03 | 0.668 | 0.755 | 0.501 |
## Export
| exp_name | model_type | calibration dataset | quantization method | quantization bits | infer speed(tokens/s) | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------------------- | ------------------- | ----------------- | --------------------- | ------------------ | ---------------- | ------------------ |
|awq-ms-bench-mini|qwen-7b-chat|ms-bench-mini|awq|4|27.25(16501 tokens/605.47 seconds)|0.494|0.665|0.571|
|awq-pileval|qwen-7b-chat|pileval|awq|4|26.92(12994 tokens/482.72 seconds)|**0.497**|**0.675**|**0.577**|
|gptq-ms-bench-mini|qwen-7b-chat|ms-bench-mini|gptq|4|31.16(15349 tokens/492.54 seconds)|0.482|0.642|0.556|
|gptq-pileval|qwen-7b-chat|pileval|gptq|4|31.67(15185 tokens/479.54 seconds)|0.478|0.654|0.559|
## AWQ
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
|qwen1half-7b-chat-awq|qwen1half-7b-chat-awq|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|19.9885(1.5802%)|True|True|lr=5e-05/epoch=2|24.26GiB|0.45(87543 samples/194746.58 seconds)|16.08(2469 tokens/153.58 seconds)|**0.55**|**1.19**|**0.505**|**0.737**|**0.656**|
## AQLM
| exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
| -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ |
|llama2-7b-aqlm-2bit-1x16|llama2-7b-aqlm-2bit-1x16|dureader-robust-zh|0.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|19.9885(1.6510%)|True|True|lr=5e-05/epoch=2|4.04GiB|0.17(14994 samples/86140.71 seconds)||**0.48**|**0.74**||||
## Sequence Parallel
| Model |
Dataset |
Hyper params |
Total steps |
Train speed |
Gpu memory |
| chatglm3-6b-32k |
long-alpaca-12k(8055 tokens * 12000 rows) |
gpu=2/sequence_parallel_size=1(双GPU DDP基准测试) |
5940 |
0.30iter/s(5h13min total) |
27G*2 |
| gpu=2/sequence_parallel_size=2(双GPU序列并行2) |
11880 |
0.5iter/s(6h total) |
20G*2 |
| gpu=4/sequence_parallel_size=4(四GPU序列并行4) |
11880 |
1iter/s(3h20min total) |
18G*4 |
| gpu=4/sequence_parallel_size=2(四GPU序列并行2) |
5940 |
0.45iter/s(3h total) |
21G*4 |