No commit message

No commit message

No commit message
ea58ee75 · laibao · a494d51f · ea58ee75
Commit ea58ee75 authored Oct 16, 2024 by laibao
Hide whitespace changes
Inline Side-by-side

Showing with 93 additions and 1 deletion

README.md README.md +93 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -120,7 +120,99 @@ python examples/llava_example.py
    output:               The image features a close-up view of a stop sign on a city street
-精度
+```bash
+python benchmarks/benchmark_throughput.py --num-prompts 1 --input-len 32 --output-len 128 --model Qwen/Qwen1.5-7B-Chat -tp 1 --trust-remote-code --enforce-eager --dtype float16
+```
+其中 `--num-prompts`是batch数，`--input-len`是输入seqlen，`--output-len`是输出token长度，`--model`为模型路径，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。若指定 `--output-len  1`即为首字延迟。`-q gptq`为使用gptq量化模型进行推理。
+2、使用数据集
+下载数据集：
+```bash
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
+```
+```bash
+python benchmarks/benchmark_throughput.py --num-prompts 1 --model Qwen/Qwen1.5-7B-Chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json -tp 1 --trust-remote-code --enforce-eager --dtype float16
+```
+其中 `--num-prompts`是batch数，`--model`为模型路径，`--dataset`为使用的数据集，`-tp`为使用卡数，`dtype="float16"`为推理数据类型，如果模型权重是bfloat16,需要修改为float16推理。`-q gptq`为使用gptq量化模型进行推理。
+### api服务推理性能测试
+1、启动服务端：
+```bash
+python -m vllm.entrypoints.openai.api_server  --model Qwen/Qwen1.5-7B-Chat  --dtype float16 --enforce-eager -tp 1 
+```
+2、启动客户端：
+```bash
+python benchmarks/benchmark_serving.py --model Qwen/Qwen1.5-7B-Chat --dataset ShareGPT_V3_unfiltered_cleaned_split.json  --num-prompts 1 --trust-remote-code
+```
+参数同使用数据集，离线批量推理性能测试，具体参考[benchmarks/benchmark_serving.py](benchmarks/benchmark_serving.py)
+### OpenAI兼容服务
+启动服务：
+```bash
+python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen1.5-7B-Chat --enforce-eager --dtype float16 --trust-remote-code
+```
+这里 `--model`为加载模型路径，`--dtype`为数据类型：float16，默认情况使用tokenizer中的预定义聊天模板，`--chat-template`可以添加新模板覆盖默认模板,`-q gptq`为使用gptq量化模型进行推理,`-q awqq`为使用awq量化模型进行推理。
+列出模型型号：
+```bash
+curl http://localhost:8000/v1/models
+```
+### OpenAI Completions API和vllm结合使用
+```bash
+curl http://localhost:8000/v1/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen1.5-7B",
+        "prompt": "What is deep learning?",
+        "max_tokens": 7,
+        "temperature": 0
+    }'
+```
+或者使用[examples/openai_completion_client.py](examples/openai_completion_client.py)
+### OpenAI Chat API和vllm结合使用
+```bash
+curl http://localhost:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "Qwen/Qwen1.5-7B-Chat",
+        "messages": [
+            {"role": "system", "content": "What is deep learning?"},
+            {"role": "user", "content": "What is deep learning?"}
+        ]
+    }'
+```
+或者使用[examples/openai_chatcompletion_client.py](examples/openai_chatcompletion_client.py)
+## result
+使用的加速卡:1张 DCU-K100_AI-64G
+```
+Prompt: 'What is deep learning?', Generated text: ' Deep learning is a subset of machine learning that involves the use of neural networks to model and solve complex problems. Neural networks are a network of interconnected nodes or " neurons" that are designed to recognize patterns in data, learn from examples, and make predictions or decisions.\nThe term "deep" in deep learning refers to the use of multiple layers or hidden layers in these neural networks. Each layer processes the input data in a different way, extracting increasingly abstract features as the data passes through.'
+```
+### 精度
 无