checkin benchmark on real conversation data (#156)

* checkin benchmark on real conversation data * change resolution * update

checkin benchmark on real conversation data (#156)
* checkin benchmark on real conversation data * change resolution * update
0bd1fa40 · lvhan028 · GitHub · 0cc9d095 · 0bd1fa40 · 0bd1fa40
Unverified Commit 0bd1fa40 authored Jul 24, 2023 by lvhan028 Committed by GitHub Jul 24, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 12 additions and 14 deletions

README.md README.md +6 -7

README_zh-CN.md README_zh-CN.md +6 -7

No files found.
--- a/README.md
+++ b/README.md
@@ -35,17 +35,16 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by

 ## Performance

-As shown in the figure below, we have compared the token generation speed among facebookresearch/llama, HuggingFace Transformers, and DeepSpeed on the 7B model.
+**Case I**: output token throughput with fixed input token and output token number (1, 2048)

-Target Device: NVIDIA A100(80G)
+**Case II**: request throughput with real conversation data

-Metrics: Throughput (token/s)
+Test Setting: LLaMA-7B, NVIDIA A100(80G)

-Test Data: The number of input tokens is 1, and the number of generated tokens is 2048
+The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
+And the request throughput of TurboMind is 30% higher than vLLM.

-The throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x
-
-![benchmark](https://user-images.githubusercontent.com/12756472/251422522-e94a3db9-eb16-432a-8d8c-078945e7b99a.png)
+![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)

 ## Quick Start


--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -36,17 +36,16 @@ LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](ht

 ## 性能

-如下图所示，我们对比了 facebookresearch/llama、HuggingFace Transformers、DeepSpeed 在 7B 模型上的token生成的速度。
+**场景一**: 固定的输入、输出token数（1,2048），测试 output token throughput

-测试设备：NVIDIA A100(80G)
+**场景二**: 使用真实数据，测试 request throughput

-测试指标：吞吐量（token/s)
+测试配置：LLaMA-7B, NVIDIA A100(80G)

-测试数据：输入token数为1，生成token数为2048
+TurboMind 的 output token throughput 超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15%，比 huggingface transformers 提升 2.3 倍
+在 request throughput 指标上，TurboMind 的效率比 vLLM 高 30%

-TurboMind 的吞吐量超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15%，比 huggingface transformers 提升 2.3 倍
-
-![benchmark](https://user-images.githubusercontent.com/12756472/251422522-e94a3db9-eb16-432a-8d8c-078945e7b99a.png)
+![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)

 ## 快速上手