Unverified Commit 0bd1fa40 authored by lvhan028's avatar lvhan028 Committed by GitHub
Browse files

checkin benchmark on real conversation data (#156)

* checkin benchmark on real conversation data

* change resolution

* update
parent 0cc9d095
...@@ -35,17 +35,16 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by ...@@ -35,17 +35,16 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
## Performance ## Performance
As shown in the figure below, we have compared the token generation speed among facebookresearch/llama, HuggingFace Transformers, and DeepSpeed on the 7B model. **Case I**: output token throughput with fixed input token and output token number (1, 2048)
Target Device: NVIDIA A100(80G) **Case II**: request throughput with real conversation data
Metrics: Throughput (token/s) Test Setting: LLaMA-7B, NVIDIA A100(80G)
Test Data: The number of input tokens is 1, and the number of generated tokens is 2048 The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
The throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x ![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
![benchmark](https://user-images.githubusercontent.com/12756472/251422522-e94a3db9-eb16-432a-8d8c-078945e7b99a.png)
## Quick Start ## Quick Start
......
...@@ -36,17 +36,16 @@ LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](ht ...@@ -36,17 +36,16 @@ LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](ht
## 性能 ## 性能
如下图所示,我们对比了 facebookresearch/llama、HuggingFace Transformers、DeepSpeed 在 7B 模型上的token生成的速度。 **场景一**: 固定的输入、输出token数(1,2048),测试 output token throughput
测试设备:NVIDIA A100(80G) **场景二**: 使用真实数据,测试 request throughput
测试指标:吞吐量(token/s) 测试配置:LLaMA-7B, NVIDIA A100(80G)
测试数据:输入token数为1,生成token数为2048 TurboMind 的 output token throughput 超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15%,比 huggingface transformers 提升 2.3 倍
在 request throughput 指标上,TurboMind 的效率比 vLLM 高 30%
TurboMind 的吞吐量超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15%,比 huggingface transformers 提升 2.3 倍 ![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
![benchmark](https://user-images.githubusercontent.com/12756472/251422522-e94a3db9-eb16-432a-8d8c-078945e7b99a.png)
## 快速上手 ## 快速上手
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment