# Benchmark ## 目录 - [参数设置](#参数设置) - [量化](#量化) - [Model Type & Max Length](#model-type--max-length) - [Batch Size](#batch-size) - [Use Flash Attn & Gradient Checkpointing](#use-flash-attn--gradient-checkpointing) - [LoRA Rank & LoRA Target Modules](#lora-rank--lora-target-modules) - [Gradient Accumulation Steps](#gradient-accumulation-steps) - [Tuners](#Tuners) - [Export](#Export) - [AWQ](#AWQ) - [AQLM](#AQLM) - [Sequence Parallel](#Sequence-Parallel) ## 参数设置 实验环境: - A100 - CUDA 11.8 - python 3.10 - torch 2.1.1 - flash_attn 2.3.4 - xformers 0.0.23 - auto_gptq 0.5.1 - bitsandbytes 0.41.3.post2 以下为所有实验的相同命令行设置部分: ```bash --dataset_test_ratio 0 \ --dataset cls-fudan-news-zh \ --save_strategy no \ --check_dataset_strategy warning \ --preprocess_num_proc 4 \ ``` 如果未指定以下参数, 则使用以下默认值: ```bash --max_length 2048 \ --batch_size 1 \ --gradient_checkpointing true \ --use_flash_attn true \ --lora_rank 8 \ --lora_target_modules DEFAULT \ --quantization_bit 0 \ --gradient_accumulation_steps 16 \ ``` 对应测试数据集的token数统计量(由qwen的tokenizer获取): 3234.4±2547.5, min=91, max=19548. 实验使用脚本可以查看`scripts/benchmark/test_memory_time/`. ## 量化 测试脚本为: ```bash swift sft \ --model_type {MODEL_TYPE} \ --quantization_bit {QUANTIZATION_BIT} \ --sft_type lora \ ... ```
Model Type [LoRA] Quantization Training Speed (samples/s) GPU Memory (GiB)
qwen-7b-chat bf16 4.31 27.74
int4 (gptq) 2.05 19.21
int8 (gptq) 1.97 22.20
int4 (bnb) 2.41 23.85
qwen-14b-chat bf16 2.60 40.14
int4 (gptq) 1.15 23.30
int8 (gptq) 1.08 29.13
int4 (bnb) 1.36 30.05
qwen-72b-chat bf16 0.59 (2*A100) 73.71+78.54
int4 (gptq) 0.23 54.86
int8 (gptq) 0.21 78.44
int4 (bnb) 0.28 74.87
## Model Type & Max Length ### LoRA 测试脚本为: ```bash swift sft \ --model_type {MODEL_TYPE} \ --max_length {MAX_LENGTH} \ --sft_type lora \ ... ```
Model Type [LoRA] Max Length Training Speed (samples/s) GPU Memory (GiB)
qwen-1_8b-chat 512 9.88 6.99
1024 9.90 10.71
2048 8.77 16.35
4096 5.92 23.80
8192 4.19 37.03
qwen-7b-chat 512 7.43 18.01
1024 6.51 21.73
2048 4.31 27.74
4096 2.05 35.31
8192 1.34 48.41
qwen-14b-chat 512 5.63 30.14
1024 4.36 34.43
2048 2.60 40.14
4096 1.17 47.95
8192 0.79 60.74
qwen-72b-chat (2*A100) 512 1.41 67.68+73.07
1024 1.02 70.25+77.11
2048 0.59 73.71+78.54
4096 - OOM
8192 - OOM
chatglm3-6b 512 6.72 13.94
1024 6.16 12.99
2048 4.20 17.20
4096 1.92 29.80
8192 1.24 66.82
yi-6b-chat 512 5.27 13.72
1024 5.07 15.44
2048 3.84 16.95
4096 1.99 28.25
8192 1.35 43.81
yi-34b-chat 512 2.32 66.72
1024 1.76 69.10
2048 1.05 71.34
4096 0.47 78.72
8192 0.31 (2*A100) 47.01+65.03
openbuddy-zephyr-7b-chat 512 5.17 14.99
1024 3.92 16.57
2048 3.08 19.89
4096 1.85 23.29
8192 0.92 52.14
baichuan2-7b-chat 512 6.09 18.18
1024 5.36 17.45
2048 3.43 19.18
4096 1.69 34.22
8192 1.16 45.47
baichuan2-13b-chat 512 5.32 31.01
1024 3.91 31.58
2048 1.77 32.40
4096 0.65 49.63
8192 0.36 76.17
### Full 测试脚本为: ```bash swift sft \ --model_type {MODEL_TYPE} \ --max_length {MAX_LENGTH} \ --sft_type full \ ... ```
Model Type [FULL] Max Length Training Speed (samples/s) GPU Memory (GiB)
qwen-1_8b-chat 512 10.77 18.16
1024 10.39 18.62
2048 8.73 35.11
4096 5.45 31.62
8192 3.81 38.93
qwen-7b-chat 512 5.96 73.37
1024 5.00 73.64
2048 3.30 74.26
4096 1.64 78.76
8192 1.11 (2*A100) 61.34+73.00
qwen-14b-chat (2*A100) 512 3.66 60.42+72.31
1024 2.98 60.61+74.37
2048 1.93 60.70+78.22
4096 0.92 75.59+78.64
8192 0.62 76.59+77.68
## Batch Size 测试脚本为: ```bash swift sft \ --batch_size {BATCH_SIZE} \ --model_type qwen-7b-chat \ --sft_type lora \ ... ```
Model Type [LoRA] Batch Size Training Speed (samples/s) GPU Memory (GiB)
qwen-7b-chat 1 4.31 27.74
2 3.60 43.11
4 3.02 63.81
8 2.77 76.14
## Use Flash Attn & Gradient Checkpointing 测试脚本为: ```bash swift sft \ --use_flash_attn {USE_FLASH_ATTN} \ --gradient_checkpointing {GRADIENT_CHECKPOINTING} \ --model_type qwen-7b-chat \ --sft_type lora \ ... ```
Model Type [LoRA] Use Flash Attn Gradient Checkpointing Training Speed (samples/s) GPU Memory (GiB)
qwen-7b-chat 4.31 27.74
6.19 37.70
3.13 27.71
4.45 57.67
## LoRA Rank & LoRA Target Modules 测试脚本为: ```bash swift sft \ --lora_rank {LORA_RANK} \ --lora_target_modules {LORA_TARGET_MODULES} \ --model_type qwen-7b-chat \ --sft_type lora \ ... ```
Model Type [LoRA] LoRA Rank LoRA Target Modules Training Speed (samples/s) GPU Memory (GiB) Trainable Params (M)
qwen-7b-chat 2 DEFAULT (c_attn) 4.27 27.72 1.05
8 DEFAULT 4.31 27.74 4.19
64 DEFAULT 4.19 27.85 33.55
8 ALL (all linear) 3.22 27.87 17.89
## Gradient Accumulation Steps 测试脚本为: ```bash swift sft \ --gradient_accumulation_steps {GRADIENT_ACCUMULATION_STEPS} \ --model_type qwen-7b-chat \ --sft_type lora \ ... ```
Model Type [LoRA] Gradient Accumulation Steps Training Speed (samples/s) GPU Memory (GiB)
qwen-7b-chat 1 4.26 27.73
2 4.32 27.74
4 4.31 27.74
8 4.32 27.74
16 4.33 27.74
32 4.30 27.74
64 4.32 27.74
## Tuners | exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc | | -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ | |adalora|qwen-7b-chat|ms-agent|2.0|adalora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|26.8389(0.3464%)|True|True|lr=5e-05/epoch=2|32.55GiB|0.92(87543 samples/95338.71 seconds)|17.33(2345 tokens/135.29 seconds)|0.57|1.07|0.391|0.665|0.569| |adapter|qwen-7b-chat|ms-agent|2.0|adapter||33.6896(0.4344%)|True|True|lr=5e-05/epoch=2|32.19GiB|1.48(87543 samples/59067.71 seconds)|26.63(4019 tokens/150.90 seconds)|0.55|1.03|0.438|0.662|0.565| |dora|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=True|19.2512(0.2487%)|True|True|lr=5e-05/epoch=2|32.46GiB|0.51(87543 samples/171110.54 seconds)|4.29(2413 tokens/562.32 seconds)|0.53|1.01|0.466|0.683|**0.577**| |full+galore128|qwen-7b-chat|ms-agent|2.0|full|galore_rank=128/galore_per_parameter=false/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|47.02GiB|1.10(87543 samples/79481.96 seconds)|28.96(2400 tokens/82.88 seconds)|0.55|1.00|0.358|**0.688**|**0.577**| |full+galore32|qwen-7b-chat|ms-agent|2.0|full|galore_rank=32/galore_per_parameter=false/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|47.05GiB|1.11(87543 samples/78989.74 seconds)|29.17(2431 tokens/83.35 seconds)|0.56|1.01|0.386|0.667|0.539| |full+galore64|qwen-7b-chat|ms-agent|2.0|full|galore_rank=64/galore_per_parameter=false/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|46.91GiB|1.11(87543 samples/79200.36 seconds)|28.94(2448 tokens/84.60 seconds)|0.56|1.01|0.397|0.674|0.544| |full+galore_emb|qwen-7b-chat|ms-agent|2.0|full|galore_rank=128/galore_per_parameter=false/galore_with_embedding=true|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|44.53GiB|1.10(87543 samples/79775.02 seconds)|29.45(2433 tokens/82.62 seconds)|0.55|1.00|0.398|0.670|0.568| |full+galore_perparam|qwen-7b-chat|ms-agent|2.0|full|galore_rank=128/galore_per_parameter=true/galore_with_embedding=false|7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|47.02GiB|1.25(87543 samples/69821.89 seconds)|29.02(2478 tokens/85.39 seconds)|0.54|1.00|0.372|0.669|0.524| |full+no_mix|qwen-7b-chat|ms-agent|0.0|full||7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|72.56GiB|1.27(29698 samples/23356.97 seconds)|30.31(11738 tokens/387.29 seconds)|0.57|**0.44**|0.174|0.652|0.553| |full|qwen-7b-chat|ms-agent|2.0|full||7721.3245(100.0000%)|True|True|lr=5e-05/epoch=2|73.53GiB|1.43(87543 samples/61022.97 seconds)|29.51(3382 tokens/114.62 seconds)|0.54|0.95|0.343|0.536|0.495| |llamapro|qwen-7b-chat|ms-agent|2.0|llamapro|num_blocks=4|809.5826(9.4900%)|True|True|lr=5e-05/epoch=2|38.11GiB|1.53(87543 samples/57294.42 seconds)|25.80(2374 tokens/92.02 seconds)|0.53|1.00|0.434|0.645|0.357| |lora+|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=16.0/use_rslora=False/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.95(87543 samples/91923.80 seconds)|18.81(3329 tokens/176.94 seconds)|0.53|0.98|0.432|0.647|0.344| |lora+neftune|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/neftune_noise_alpha=15.0|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.96(87543 samples/91525.50 seconds)|19.84(161792 tokens/8156.02 seconds)|0.53|1.02|0.456|0.671|0.401| |lora+no_mix|qwen-7b-chat|ms-agent|0.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|30.86GiB|0.91(29698 samples/32570.15 seconds)|19.89(36308 tokens/1825.26 seconds)|0.53|0.53|0.470|0.666|0.574| |lora|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.95(87543 samples/91974.29 seconds)|18.11(2415 tokens/133.32 seconds)|0.53|1.01|0.462|0.676|0.304| |qwen-7b-chat-eval|qwen-7b-chat|None|0.0|None||None(None)||||None||30.81(13765 tokens/446.83 seconds)|||**0.517**|0.679|0.568| |rslora|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=True/use_dora=False|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.35GiB|0.94(87543 samples/92758.63 seconds)|18.87(2762 tokens/146.34 seconds)|**0.53**|0.99|0.451|0.679|0.339| | full+lisa_2 | qwen-7b-chat | ms-agent | 2.0 | full | lisa_activated_layers=2/lisa_step_interval=20 | - | True | True | lr=5e-05/epoch=2 | 31.11GiB | 2.66(76837 samples/28881.28 seconds) | 36.10(134469 tokens/3725.21 seconds) | 0.62 | 1.06 | 0.349 | 0.653 | 0.592 | | full+lisa_4 | qwen-7b-chat | ms-agent | 2.0 | full | lisa_activated_layers=4/lisa_step_interval=20 | - | True | True | lr=5e-05/epoch=2 | 31.87GiB | 2.63(76837 samples/29215.15 seconds) | 36.75(135477 tokens/3686.17 seconds) | 0.63 | 1.06 | 0.377 | 0.656 | **0.607** | |lora+packing+ddp|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|35.65GiB*2|1.56(7900 samples/5057.30 seconds)|26.20(421094 tokens/16073.09 seconds)|0.63|0.98|0.473|0.664|0.552| |lora+packing+lazytokenize|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|32.83GiB|7.69(78237 samples/10179.40 seconds)|25.86(307390 tokens/11888.17 seconds)|0.63|1.04|0.472|0.660|0.554| |lora+packing|qwen-7b-chat|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False/packing=True|17.8913(0.2312%)|True|True|lr=5e-05/epoch=2|28.06GiB|0.79(7900 samples/10048.53 seconds)|26.12(409507 tokens/15675.36 seconds)|0.61|0.95|0.492|0.676|0.539| ## unsloth | exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc | | --------------- | ------------------ | -------- | ------------------ | ----- | ------------ | ------------------- | ---------- | ---------------------- | ---------------- | -------- | ------------------------------------ | ------------------------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ | | unsloth+lora+q4 | llama3-8b-instruct | ms-agent | 2.0 | lora | | 4.7186(0.1038%) | True | True | lr=5e-05/epoch=2 | 21.69GiB | 1.76(76839 samples/43763.01 seconds) | 15.22(160885 tokens/10570.90 seconds) | 0.58 | 1.03 | 0.668 | 0.755 | 0.501 | ## Export | exp_name | model_type | calibration dataset | quantization method | quantization bits | infer speed(tokens/s) | gsm8k weighted acc | arc weighted acc | ceval weighted acc | | -------- | ---------- | ------------------- | ------------------- | ----------------- | --------------------- | ------------------ | ---------------- | ------------------ | |awq-ms-bench-mini|qwen-7b-chat|ms-bench-mini|awq|4|27.25(16501 tokens/605.47 seconds)|0.494|0.665|0.571| |awq-pileval|qwen-7b-chat|pileval|awq|4|26.92(12994 tokens/482.72 seconds)|**0.497**|**0.675**|**0.577**| |gptq-ms-bench-mini|qwen-7b-chat|ms-bench-mini|gptq|4|31.16(15349 tokens/492.54 seconds)|0.482|0.642|0.556| |gptq-pileval|qwen-7b-chat|pileval|gptq|4|31.67(15185 tokens/479.54 seconds)|0.478|0.654|0.559| ## AWQ | exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc | | -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ | |qwen1half-7b-chat-awq|qwen1half-7b-chat-awq|ms-agent|2.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|19.9885(1.5802%)|True|True|lr=5e-05/epoch=2|24.26GiB|0.45(87543 samples/194746.58 seconds)|16.08(2469 tokens/153.58 seconds)|**0.55**|**1.19**|**0.505**|**0.737**|**0.656**| ## AQLM | exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc | | -------- | ---------- | ------- | -------------------| ----- | ------------ | ------------------- | -----------| ---------------------- | ------ | ------ | ---------------------- | --------------------- | ---------- | --------- | ------------------ | ---------------- | ------------------ | |llama2-7b-aqlm-2bit-1x16|llama2-7b-aqlm-2bit-1x16|dureader-robust-zh|0.0|lora|rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False|19.9885(1.6510%)|True|True|lr=5e-05/epoch=2|4.04GiB|0.17(14994 samples/86140.71 seconds)||**0.48**|**0.74**|||| ## Sequence Parallel
Model Dataset Hyper params Total steps Train speed Gpu memory
chatglm3-6b-32k long-alpaca-12k(8055 tokens * 12000 rows) gpu=2/sequence_parallel_size=1(双GPU DDP基准测试) 5940 0.30iter/s(5h13min total) 27G*2
gpu=2/sequence_parallel_size=2(双GPU序列并行2) 11880 0.5iter/s(6h total) 20G*2
gpu=4/sequence_parallel_size=4(四GPU序列并行4) 11880 1iter/s(3h20min total) 18G*4
gpu=4/sequence_parallel_size=2(四GPU序列并行2) 5940 0.45iter/s(3h total) 21G*4