README.md 7.49 KB
Newer Older
jerrrrry's avatar
jerrrrry committed
1
2
# 0.7.2
1. Offline推理
jerrrrry's avatar
jerrrrry committed
3
4
5
6
benchmark_throughput_0.7.2.py
使用如下脚本可以减少不同参数推理时反复load模型
batch prompt_tokens completion_tokens可以用空格分隔传成字符串
其他参数与标准脚本一致
jerrrrry's avatar
jerrrrry committed
7
<pre>
jerrrrry's avatar
jerrrrry committed
8
9
10
11
12
13
14
15
16
export HIP_VISIBLE_DEVICES=1
tp=1
model_path=/llm-models/qwen1.5/Qwen1.5-0.5B-Chat

batch="1 2"
prompt_tokens="16 64"
completion_tokens="128 256"
python benchmark_throughput_0.7.2.py --model ${model_path} --tensor-parallel-size ${tp} --num-prompts ${batch} --input-len ${prompt_tokens} --output-len ${completion_tokens} \
    --dtype float16  --trust-remote-code --max-model-len 32768 --output-json ./test_0.5B-0.7.2.txt
jerrrrry's avatar
jerrrrry committed
17
</pre>
jerrrrry's avatar
jerrrrry committed
18
19
20
21
22
23
24
25

按照如上传参,则计算的场景如下:
bs    input    output
1      16        128
1      64        256
2      16        128
2      64        256

jerrrrry's avatar
jerrrrry committed
26
推理结果汇总在--output-json ./test_0.5B-0.7.2.txt当中,示例如下:
jerrrrry's avatar
jerrrrry committed
27
28
29
30
31
32
33
34

bash
bs_in_out,elapsed_time,Throughput,total_tokens,output_tokens,ttft_mean,ttft_median,ttft_p99,tpop_mean,tpop_median,tpop_p99,output_token_throughput_mean,output_token_throughput_median,output_token_throughput_p99,inout_token_throughput_mean,inout_token_throughput_median,inout_token_throughput_p99
1_16_128,3.49,0.29,41.26,36.68,0.03801,0.03801,0.03801,0.0269,0.02691,0.02691,37.04,37.04,37.04,41.66,41.66,41.66
1_64_256,7.14,0.14,44.82,35.85,0.0291,0.0291,0.0291,0.0278,0.02776,0.02776,36.01,36.01,36.01,45.01,45.01,45.01
2_16_128,3.62,0.55,79.56,70.72,0.04829,0.04829,0.04893,0.028,0.02801,0.02801,35.51,35.51,35.51,39.94,39.94,39.95
2_64_256,7.31,0.27,87.55,70.04,0.04697,0.04697,0.04764,0.0284,0.02836,0.02836,35.17,35.17,35.18,43.97,43.97,43.97

jerrrrry's avatar
jerrrrry committed
35
2. Server推理
jerrrrry's avatar
jerrrrry committed
36
37
38
39
40
41
42
benchmark_servein_0.7.2.py
backend_request_func.py
使用此方式可以避免sever推理时实际生成长度和指定长度不一致问题
bash
#使用提供的脚本进行测试

#启动server
jerrrrry's avatar
jerrrrry committed
43
</pre>
jerrrrry's avatar
jerrrrry committed
44
vllm serve $MODEL_PATH  --trust-remote-code   --dtype $dtype --max-model-len $max_len -tp $tp  --gpu-memory-utilization 0.97
jerrrrry's avatar
jerrrrry committed
45
</pre>
jerrrrry's avatar
jerrrrry committed
46
47
48
49
50



#发送请求
#--distributed-executor-backend ray等其他参数根据实际情况添加
jerrrrry's avatar
jerrrrry committed
51
</pre>
jerrrrry's avatar
jerrrrry committed
52
53
方式与平常一样,只是需要加上--ignore-eos
python  benchmark_servein_0.7.2.py --backend vllm --ignore-eos  --dataset-name random --random-input-len  $input_len --random-output-len  $output_len --model $MODEL_PATH  --num-prompts $num_prompts --endpoint /v1/completions
jerrrrry's avatar
jerrrrry committed
54
</pre>
jerrrrry's avatar
jerrrrry committed
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
prof
offline_prof
hipprof
prof.py
benchmark_throughput_0.7.2_hipprof.py
bash
#使用示例:
黄色背景为额外添加的部分
SGLANG_PROF_ROCTX=1 hipprof --trace-off python benchmark_throughput_0.7.2_hipprof.py --num-prompts 1  --input-len 2000  --output-len 1 --model  /models/Llama-2-7b-hf  --trust-remote-code  --enforce-eager --dtype float16 > 7b-prefill-2000-test.log 2>&1 
torchprof
benchmark_throughput_0.7.2_torchprof.py
bash
#启动方式与平常使用一致
benchmark_throughput_0.7.2_torchprof.py --num-prompts 1  --input-len 2000  --output-len 1 --model  /models/Llama-2-7b-hf  --trust-remote-code  --enforce-eager --dtype float16 > 7b-prefill-2000-test.log 2>&1

会打印prof信息,保存的json文件名为:
{args.num_prompts}-{args.input_len}-{args.output_len}-{args.tensor_parallel_size}_dcu.json

server_prof
worker.py
bash
替换/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py

#启动服务
loca_path为保存的json文件绝对路径
export VLLM_TORCH_PROFILER_DIR=$loca_path 
vllm serve $MODEL_PATH  --trust-remote-code   --dtype $dtype --max-model-len $max_len -tp $tp  --gpu-memory-utilization 0.97

#发送请求
#--distributed-executor-backend ray等其他参数根据实际情况添加
python  benchmark_servein_0.7.2.py --backend vllm --ignore-eos  --profile --dataset-name random --random-input-len  $input_len --random-output-len  $output_len --model $MODEL_PATH  --num-prompts $num_prompts --endpoint /v1/completions

0.6.2
Offline推理
benchmark_throughput_0.6.2.py
使用如下脚本可以减少不同参数推理时反复load模型
batch prompt_tokens completion_tokens可以用空格分隔传成字符串
其他参数与标准脚本一致
bash
export HIP_VISIBLE_DEVICES=1
tp=1
model_path=/llm-models/qwen1.5/Qwen1.5-0.5B-Chat


batch="1 2"
prompt_tokens="16 64"
completion_tokens="128 256"
python benchmark_throughput_0.6.2.py --model ${model_path} --tensor-parallel-size ${tp} --num-prompts ${batch} --input-len ${prompt_tokens} --output-len ${completion_tokens} \
    --dtype float16  --trust-remote-code --max-model-len 32768 --output-json ./test_0.5B-0.6.2.txt

按照如上传参,则计算的场景如下:
bs    input    output
1      16        128
1      64        256
2      16        128
2      64        256

推理结果汇总在--output-json ./test_0.5B-0.6.2.txt当中,示例如下表:

bash
bs_in_out,elapsed_time,Throughput,total_tokens,output_tokens,ttft_mean,ttft_median,ttft_p99,tpop_mean,tpop_median,tpop_p99,output_token_throughput_mean,output_token_throughput_median,output_token_throughput_p99,inout_token_throughput_mean,inout_token_throughput_median,inout_token_throughput_p99
1_16_128,3.49,0.29,41.26,36.68,0.03801,0.03801,0.03801,0.0269,0.02691,0.02691,37.04,37.04,37.04,41.66,41.66,41.66
1_64_256,7.14,0.14,44.82,35.85,0.0291,0.0291,0.0291,0.0278,0.02776,0.02776,36.01,36.01,36.01,45.01,45.01,45.01
2_16_128,3.62,0.55,79.56,70.72,0.04829,0.04829,0.04893,0.028,0.02801,0.02801,35.51,35.51,35.51,39.94,39.94,39.95
2_64_256,7.31,0.27,87.55,70.04,0.04697,0.04697,0.04764,0.0284,0.02836,0.02836,35.17,35.17,35.18,43.97,43.97,43.97

Server推理
benchmark_servein_0.6.2.py
backend_request_func.py
使用此方式可以减少server生成长度和指定长度差距过大
Bash
#使用提供的脚本进行测试

#启动server
vllm serve $MODEL_PATH  --trust-remote-code   --dtype $dtype --max-model-len $max_len -tp $tp  --gpu-memory-utilization 0.97



#发送请求
#--distributed-executor-backend ray等其他参数根据实际情况添加
方式与平常一样,只是需要加上--ignore-eos
python  benchmark_servein_0.6.2.py --backend vllm --ignore-eos  --dataset-name random --random-input-len  $input_len --random-output-len  $output_len --model $MODEL_PATH  --num-prompts $num_prompts --endpoint /v1/completions



prof
offline_porf
hipprof
prof.py
benchmark_throughput_0.6.2_hipprof.py

bash
#使用示例:
黄色背景为额外添加的部分
SGLANG_PROF_ROCTX=1 hipprof --trace-off python benchmark_throughput_0.6.2_hipprof.py --num-prompts 1  --input-len 2000  --output-len 1 --model  /models/Llama-2-7b-hf  --trust-remote-code  --enforce-eager --dtype float16 > 7b-prefill-2000-test.log 2>&1 
torchprof
benchmark_throughput_0.6.2_torchprof.py
bash
#启动方式与平常使用一致
benchmark_throughput_0.6.2_torchprof.py --num-prompts 1  --input-len 2000  --output-len 1 --model  /models/Llama-2-7b-hf  --trust-remote-code  --enforce-eager --dtype float16 > 7b-prefill-2000-test.log 2>&1

会打印prof信息,保存的json文件名为:
{args.num_prompts}-{args.input_len}-{args.output_len}-{args.tensor_parallel_size}_dcu.json
server_prof
worker.py
bash
替换/usr/local/lib/python3.10/site-packages/vllm/worker/worker.py

#启动服务
loca_path为保存的json文件绝对路径
export VLLM_TORCH_PROFILER_DIR=$loca_path 
vllm serve $MODEL_PATH  --trust-remote-code   --dtype $dtype --max-model-len $max_len -tp $tp  --gpu-memory-utilization 0.97

#发送请求
#--distributed-executor-backend ray等其他参数根据实际情况添加
python  benchmark_servein_0.6.2.py --backend vllm --ignore-eos  --profile --dataset-name random --random-input-len  $input_len --random-output-len  $output_len --model $MODEL_PATH  --num-prompts $num_prompts --endpoint /v1/completions