Commit ea58ee75 authored by laibao's avatar laibao
Browse files

No commit message

No commit message
parent a494d51f
...@@ -120,7 +120,99 @@ python examples/llava_example.py ...@@ -120,7 +120,99 @@ python examples/llava_example.py
output: The image features a close-up view of a stop sign on a city street output: The image features a close-up view of a stop sign on a city street
精度
```bash
python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model Qwen/Qwen1.5-7B-Chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
```
其中 `--num-prompts`是batch数,`--input-len`是输入seqlen,`--output-len`是输出token长度,`--model`为模型路径,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len 1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
2、使用数据集
下载数据集:
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```
```bash
python benchmarks/benchmark_throughput.py --num-prompts 1 --model Qwen/Qwen1.5-7B-Chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
```
其中 `--num-prompts`是batch数,`--model`为模型路径,`--dataset`为使用的数据集,`-tp`为使用卡数,`dtype="float16"`为推理数据类型,如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
### api服务推理性能测试
1、启动服务端:
```bash
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat --dtype float16 --enforce-eager -tp 1
```
2、启动客户端:
```bash
python benchmarks/benchmark_serving.py --model Qwen/Qwen1.5-7B-Chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1 --trust-remote-code
```
参数同使用数据集,离线批量推理性能测试,具体参考[benchmarks/benchmark_serving.py](benchmarks/benchmark_serving.py)
### OpenAI兼容服务
启动服务:
```bash
python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat --enforce-eager --dtype float16 --trust-remote-code
```
这里 `--model`为加载模型路径,`--dtype`为数据类型:float16,默认情况使用tokenizer中的预定义聊天模板,`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理,`-q awqq`为使用awq量化模型进行推理。
列出模型型号:
```bash
curl http://localhost:8000/v1/models
```
### OpenAI Completions API和vllm结合使用
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen1.5-7B",
"prompt": "What is deep learning?",
"max_tokens": 7,
"temperature": 0
}'
```
或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
### OpenAI Chat API和vllm结合使用
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen1.5-7B-Chat",
"messages": [
{"role": "system", "content": "What is deep learning?"},
{"role": "user", "content": "What is deep learning?"}
]
}'
```
或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
## result
使用的加速卡:1张 DCU-K100_AI-64G
```
Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of machine learning that involves the use of neural networks to model and solve complex problems. Neural networks are a network of interconnected nodes or " neurons" that are designed to recognize patterns in data, learn from examples, and make predictions or decisions.\nThe term "deep" in deep learning refers to the use of multiple layers or hidden layers in these neural networks. Each layer processes the input data in a different way, extracting increasingly abstract features as the data passes through.'
```
### 精度
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment