README.md 6.31 KB
Newer Older
jerrrrry's avatar
jerrrrry committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

# 0.8.5
1. Offline推理 根据需求自定义参数
benchmark_throughput_0.8.5.py
使用如下脚本可以减少不同参数推理时反复load模型
batch prompt_tokens completion_tokens可以用空格分隔传成字符串
其他参数与标准脚本一致
<pre>
export HIP_VISIBLE_DEVICES=1
tp=1
model_path=/llm-models/qwen1.5/Qwen1.5-0.5B-Chat

batch="1 2"
prompt_tokens="16 64"
completion_tokens="128 256"
python benchmark_throughput_0.8.5.py --model ${model_path} --tensor-parallel-size ${tp} --num-prompts ${batch} --input-len ${prompt_tokens} --output-len ${completion_tokens} \
    --dtype float16  --trust-remote-code --max-model-len 32768 --output-json ./test_0.5B-0.7.2.txt
</pre>

按照如上传参,则计算的场景如下:
bs    input    output
1      16        128
1      64        256
2      16        128
2      64        256

推理结果汇总在--output-json ./test_0.5B-0.7.2.txt当中,示例如下:

bash
bs_in_out,elapsed_time,Throughput,total_tokens,output_tokens,ttft_mean,ttft_median,ttft_p99,tpop_mean,tpop_median,tpop_p99,output_token_throughput_mean,output_token_throughput_median,output_token_throughput_p99,inout_token_throughput_mean,inout_token_throughput_median,inout_token_throughput_p99
1_16_128,3.49,0.29,41.26,36.68,0.03801,0.03801,0.03801,0.0269,0.02691,0.02691,37.04,37.04,37.04,41.66,41.66,41.66
1_64_256,7.14,0.14,44.82,35.85,0.0291,0.0291,0.0291,0.0278,0.02776,0.02776,36.01,36.01,36.01,45.01,45.01,45.01
2_16_128,3.62,0.55,79.56,70.72,0.04829,0.04829,0.04893,0.028,0.02801,0.02801,35.51,35.51,35.51,39.94,39.94,39.95
2_64_256,7.31,0.27,87.55,70.04,0.04697,0.04697,0.04764,0.0284,0.02836,0.02836,35.17,35.17,35.18,43.97,43.97,43.97

2. Server推理
先 bash server.sh  等待服务起来后 再bash test.sh   根据需求修改测试参数





















jerrrrry's avatar
jerrrrry committed
59
60
# 0.7.2
1. Offline推理
jerrrrry's avatar
jerrrrry committed
61
62
63
64
benchmark_throughput_0.7.2.py
使用如下脚本可以减少不同参数推理时反复load模型
batch prompt_tokens completion_tokens可以用空格分隔传成字符串
其他参数与标准脚本一致
jerrrrry's avatar
jerrrrry committed
65
<pre>
jerrrrry's avatar
jerrrrry committed
66
67
68
69
70
71
72
73
74
export HIP_VISIBLE_DEVICES=1
tp=1
model_path=/llm-models/qwen1.5/Qwen1.5-0.5B-Chat

batch="1 2"
prompt_tokens="16 64"
completion_tokens="128 256"
python benchmark_throughput_0.7.2.py --model ${model_path} --tensor-parallel-size ${tp} --num-prompts ${batch} --input-len ${prompt_tokens} --output-len ${completion_tokens} \
    --dtype float16  --trust-remote-code --max-model-len 32768 --output-json ./test_0.5B-0.7.2.txt
jerrrrry's avatar
jerrrrry committed
75
</pre>
jerrrrry's avatar
jerrrrry committed
76
77
78
79
80
81
82
83

按照如上传参,则计算的场景如下:
bs    input    output
1      16        128
1      64        256
2      16        128
2      64        256

jerrrrry's avatar
jerrrrry committed
84
推理结果汇总在--output-json ./test_0.5B-0.7.2.txt当中,示例如下:
jerrrrry's avatar
jerrrrry committed
85
86
87
88
89
90
91
92

bash
bs_in_out,elapsed_time,Throughput,total_tokens,output_tokens,ttft_mean,ttft_median,ttft_p99,tpop_mean,tpop_median,tpop_p99,output_token_throughput_mean,output_token_throughput_median,output_token_throughput_p99,inout_token_throughput_mean,inout_token_throughput_median,inout_token_throughput_p99
1_16_128,3.49,0.29,41.26,36.68,0.03801,0.03801,0.03801,0.0269,0.02691,0.02691,37.04,37.04,37.04,41.66,41.66,41.66
1_64_256,7.14,0.14,44.82,35.85,0.0291,0.0291,0.0291,0.0278,0.02776,0.02776,36.01,36.01,36.01,45.01,45.01,45.01
2_16_128,3.62,0.55,79.56,70.72,0.04829,0.04829,0.04893,0.028,0.02801,0.02801,35.51,35.51,35.51,39.94,39.94,39.95
2_64_256,7.31,0.27,87.55,70.04,0.04697,0.04697,0.04764,0.0284,0.02836,0.02836,35.17,35.17,35.18,43.97,43.97,43.97

jerrrrry's avatar
jerrrrry committed
93
2. Server推理
jerrrrry's avatar
jerrrrry committed
94
95
96
97
benchmark_servein_0.7.2.py
backend_request_func.py
使用此方式可以避免sever推理时实际生成长度和指定长度不一致问题

jerrrrry's avatar
jerrrrry committed
98
99

<pre>
jerrrrry's avatar
jerrrrry committed
100
101
102
103
104
#启动server
vllm serve $MODEL_PATH  --trust-remote-code   --dtype $dtype --max-model-len $max_len -tp $tp  --gpu-memory-utilization 0.97

#发送请求
#--distributed-executor-backend ray等其他参数根据实际情况添加
jerrrrry's avatar
jerrrrry committed
105

jerrrrry's avatar
jerrrrry committed
106
python  benchmark_servein_0.7.2.py --backend vllm --ignore-eos  --dataset-name random --random-input-len  $input_len --random-output-len  $output_len --model $MODEL_PATH  --num-prompts $num_prompts --endpoint /v1/completions
jerrrrry's avatar
jerrrrry committed
107
</pre>
jerrrrry's avatar
jerrrrry committed
108
109


jerrrrry's avatar
jerrrrry committed
110
111
# 0.6.2
1. Offline推理
jerrrrry's avatar
jerrrrry committed
112
113
114
115
benchmark_throughput_0.6.2.py
使用如下脚本可以减少不同参数推理时反复load模型
batch prompt_tokens completion_tokens可以用空格分隔传成字符串
其他参数与标准脚本一致
jerrrrry's avatar
jerrrrry committed
116
<pre>
jerrrrry's avatar
jerrrrry committed
117
118
119
120
121
122
123
124
125
126
export HIP_VISIBLE_DEVICES=1
tp=1
model_path=/llm-models/qwen1.5/Qwen1.5-0.5B-Chat


batch="1 2"
prompt_tokens="16 64"
completion_tokens="128 256"
python benchmark_throughput_0.6.2.py --model ${model_path} --tensor-parallel-size ${tp} --num-prompts ${batch} --input-len ${prompt_tokens} --output-len ${completion_tokens} \
    --dtype float16  --trust-remote-code --max-model-len 32768 --output-json ./test_0.5B-0.6.2.txt
jerrrrry's avatar
jerrrrry committed
127
</pre>
jerrrrry's avatar
jerrrrry committed
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
按照如上传参,则计算的场景如下:
bs    input    output
1      16        128
1      64        256
2      16        128
2      64        256

推理结果汇总在--output-json ./test_0.5B-0.6.2.txt当中,示例如下表:

bash
bs_in_out,elapsed_time,Throughput,total_tokens,output_tokens,ttft_mean,ttft_median,ttft_p99,tpop_mean,tpop_median,tpop_p99,output_token_throughput_mean,output_token_throughput_median,output_token_throughput_p99,inout_token_throughput_mean,inout_token_throughput_median,inout_token_throughput_p99
1_16_128,3.49,0.29,41.26,36.68,0.03801,0.03801,0.03801,0.0269,0.02691,0.02691,37.04,37.04,37.04,41.66,41.66,41.66
1_64_256,7.14,0.14,44.82,35.85,0.0291,0.0291,0.0291,0.0278,0.02776,0.02776,36.01,36.01,36.01,45.01,45.01,45.01
2_16_128,3.62,0.55,79.56,70.72,0.04829,0.04829,0.04893,0.028,0.02801,0.02801,35.51,35.51,35.51,39.94,39.94,39.95
2_64_256,7.31,0.27,87.55,70.04,0.04697,0.04697,0.04764,0.0284,0.02836,0.02836,35.17,35.17,35.18,43.97,43.97,43.97

jerrrrry's avatar
jerrrrry committed
144
2. Server推理
jerrrrry's avatar
jerrrrry committed
145
146
147
benchmark_servein_0.6.2.py
backend_request_func.py
使用此方式可以减少server生成长度和指定长度差距过大
jerrrrry's avatar
jerrrrry committed
148
<pre>
jerrrrry's avatar
jerrrrry committed
149
150
151
152
153
154
155
156
157
158
159
#使用提供的脚本进行测试

#启动server
vllm serve $MODEL_PATH  --trust-remote-code   --dtype $dtype --max-model-len $max_len -tp $tp  --gpu-memory-utilization 0.97



#发送请求
#--distributed-executor-backend ray等其他参数根据实际情况添加
方式与平常一样,只是需要加上--ignore-eos
python  benchmark_servein_0.6.2.py --backend vllm --ignore-eos  --dataset-name random --random-input-len  $input_len --random-output-len  $output_len --model $MODEL_PATH  --num-prompts $num_prompts --endpoint /v1/completions
jerrrrry's avatar
jerrrrry committed
160
</pre>
jerrrrry's avatar
jerrrrry committed
161