Unverified Commit 0bd1fa40 authored by lvhan028's avatar lvhan028 Committed by GitHub
Browse files

checkin benchmark on real conversation data (#156)

* checkin benchmark on real conversation data

* change resolution

* update
parent 0cc9d095
......@@ -35,17 +35,16 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
## Performance
As shown in the figure below, we have compared the token generation speed among facebookresearch/llama, HuggingFace Transformers, and DeepSpeed on the 7B model.
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
Target Device: NVIDIA A100(80G)
**Case II**: request throughput with real conversation data
Metrics: Throughput (token/s)
Test Setting: LLaMA-7B, NVIDIA A100(80G)
Test Data: The number of input tokens is 1, and the number of generated tokens is 2048
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
The throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x
![benchmark](https://user-images.githubusercontent.com/12756472/251422522-e94a3db9-eb16-432a-8d8c-078945e7b99a.png)
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
## Quick Start
......
......@@ -36,17 +36,16 @@ LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](ht
## 性能
如下图所示,我们对比了 facebookresearch/llama、HuggingFace Transformers、DeepSpeed 在 7B 模型上的token生成的速度。
**场景一**: 固定的输入、输出token数(1,2048),测试 output token throughput
测试设备:NVIDIA A100(80G)
**场景二**: 使用真实数据,测试 request throughput
测试指标:吞吐量(token/s)
测试配置:LLaMA-7B, NVIDIA A100(80G)
测试数据:输入token数为1,生成token数为2048
TurboMind 的 output token throughput 超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15%,比 huggingface transformers 提升 2.3 倍
在 request throughput 指标上,TurboMind 的效率比 vLLM 高 30%
TurboMind 的吞吐量超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15%,比 huggingface transformers 提升 2.3 倍
![benchmark](https://user-images.githubusercontent.com/12756472/251422522-e94a3db9-eb16-432a-8d8c-078945e7b99a.png)
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
## 快速上手
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment