improve readme (#52)

* add performance * use png * update * update * update * update * update

improve readme (#52)
* add performance * use png * update * update * update * update * update
3e7b6bfd · lvhan028 · GitHub · adfd81d3 · 3e7b6bfd · 3e7b6bfd
Unverified Commit 3e7b6bfd authored Jul 05, 2023 by lvhan028 Committed by GitHub Jul 05, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 240 additions and 146 deletions

README.md README.md +36 -73

README_zh-CN.md README_zh-CN.md +38 -73

docs/en/serving.md docs/en/serving.md +83 -0

docs/zh_cn/serving.md docs/zh_cn/serving.md +83 -0

No files found.
--- a/README.md
+++ b/README.md
 <div align="center">
  <img src="resources/lmdeploy-logo.png" width="450"/>
-[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
-[![codecov](https://codecov.io/gh/InternLM/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/InternLM/lmdeploy)
-[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
-[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
-[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
 English | [简体中文](README_zh-CN.md)
 </div>
@@ -39,16 +33,26 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
 - **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
-<div align="center">
-  <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/>
-</div>
 - **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
 - **Persistent Batch Inference**: Further optimization of model execution efficiency.
 ![PersistentBatchInference](https://github.com/open-mmlab/lmdeploy/assets/25839884/8f8b57b8-42af-4b71-ad74-e75f39b10694)
+## Performance
+As shown in the figure below, we have compared the token generation speed among facebookresearch/llama, HuggingFace Transformers, and DeepSpeed on the 7B model.
+Target Device: NVIDIA A100(80G)
+Metrics: Throughput (token/s)
+Test Data: The number of input tokens is 1, and the number of generated tokens is 2048
+The throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x
+![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/1aa64d01-621c-4b53-8e48-e66bc4636b3b)
 ## Quick Start
 ### Installation
@@ -56,100 +60,59 @@ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by
 Below are quick steps for installation:
 ```shell
-conda create -n open-mmlab python=3.8
+conda create -n lmdeploy python=3.10
-conda activate open-mmlab
+conda activate lmdeploy
-git clone https://github.com/open-mmlab/lmdeploy.git
+git clone clone https://github.com/InternLM/lmdeploy.git
 cd lmdeploy
 pip install -e .
 ```
-### Build
+### Deploy InternLM
-Pull docker image `openmmlab/lmdeploy:latest` and build lmdeploy libs in its launched container
+#### Get InternLM model
 ```shell
-mkdir build && cd build
+# 1. Download InternLM model
-../generate.sh
-make -j$(nproc) && make install
-```
-### Serving [LLaMA](https://github.com/facebookresearch/llama)
+# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
+python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf
-Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform?usp=send_form)
-Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:
-<details close>
-<summary><b>7B</b></summary>
-```shell
-python3 lmdeploy/serve/turbomind/deploy.py llama-7B /path/to/llama-7b llama \
-    --tokenizer_path /path/to/tokenizer/model
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
 ```
-</details>
+#### Inference by TurboMind
-<details close>
-<summary><b>13B</b></summary>
 ```shell
-python3 lmdeploy/serve/turbomind/deploy.py llama-13B /path/to/llama-13b llama \
+docker run -rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
-    --tokenizer_path /path/to/tokenizer/model --tp 2
+    python3 -m lmdeploy.turbomind.chat internlm /workspace
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
 ```
-</details>
+```{note}
+When inferring with FP16 precision, the InternLM-7B model requires at least 22.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
-### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)
-<details open>
-<summary><b>7B</b></summary>
-```shell
-python3 -m pip install fschat
-python3 -m fastchat.model.apply_delta \
-  --base-model-path /path/to/llama-7b \
-  --target-model-path /path/to/vicuna-7b \
-  --delta-path lmsys/vicuna-7b-delta-v1.1
-python3 lmdeploy/serve/turbomind/deploy.py vicuna-7B /path/to/vicuna-7b hf
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
 ```
-</details>
+#### Serving
-<details>
+Launch inference server by:
-<summary><b>13B</b></summary>
 ```shell
-python3 -m pip install fschat
+bash workspace/service_docker_up.sh
-python3 -m fastchat.model.apply_delta \
-  --base-model-path /path/to/llama-13b \
-  --target-model-path /path/to/vicuna-13b \
-  --delta-path lmsys/vicuna-13b-delta-v1.1
-python3 lmdeploy/serve/turbomind/deploy.py vicuna-13B /path/to/vicuna-13b hf
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
 ```
-</details>
+Then, you can communicate with the inference server by command line,
-## Inference with Command Line Interface
 ```shell
-python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
+python3 lmdeploy.serve.client {server_ip_addresss}:33337 internlm
 ```
-## Inference with Web UI
+or webui,
-```shell
+```
-python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
+python3 lmdeploy.app {server_ip_addresss}:33337 internlm
 ```
 ![](https://github.com/open-mmlab/lmdeploy/assets/41138331/f4352172-d8b1-49aa-b658-50ce72b896a5)
-## User Guide
+For the deployment of other supported models, such as LLaMA, vicuna, you can find the guide from [here](docs/en/serving.md)
 ## Quantization

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
 <div align="center">
  <img src="resources/lmdeploy-logo.png" width="450"/>
-[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
-[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
-[![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
-[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
-[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
 [English](README.md) | 简体中文
 </div>
@@ -36,118 +30,89 @@
 LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](https://github.com/open-mmlab/mmrazor) 团队联合开发，是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
 这个强大的工具箱提供以下核心功能：
- **高效推理引擎 TurboMind**：基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)，我们实现了高效推理引擎 TurboMind，它支持 LLaMA 及其变体模型在 NVIDIA GPU 上的推理。
+- **高效推理引擎 TurboMind**：基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)，我们实现了高效推理引擎 TurboMind，支持 InternLM、LLaMA、vicuna等模型在 NVIDIA GPU 上的推理。
 - **交互推理方式**：通过缓存多轮对话过程中 attention 的 k/v，记住对话历史，从而避免重复处理历史会话。
-  <div align="center">
-    <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/>
-  </div>
 - **多 GPU 部署和量化**：我们提供了全面的模型部署和量化支持，已在不同规模上完成验证。
 - **persistent batch 推理**：进一步优化模型执行效率。
  ![PersistentBatchInference](https://github.com/open-mmlab/lmdeploy/assets/25839884/8f8b57b8-42af-4b71-ad74-e75f39b10694)
-## 快速上手
+## 性能
-### 安装
-```shell
+如下图所示，我们对比了 facebookresearch/llama、HuggingFace Transformers、DeepSpeed 在 7B 模型上的token生成的速度。
-conda create -n open-mmlab python=3.8
-conda activate open-mmlab
-git clone https://github.com/open-mmlab/lmdeploy.git
-cd lmdeploy
-pip install -e .
-```
-### 编译
+测试设备：NVIDIA A100(80G)
-下载 docker image `openmmlab/lmdeploy:latest`，挂载 lmdeploy 的数据卷，启动 container，在 container 内执行以下命令：
+测试指标：吞吐量（token/s)
-```shell
+测试数据：输入token数为1，生成token数为2048
-mkdir build && cd build
-../generate.sh
-make -j$(nproc) && make install
-```
-### 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务
+TurboMind 的吞吐量超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15%，比 huggingface transformers 提升 2.3 倍
-请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)，获取 LLaMA 模型权重。
+![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/1aa64d01-621c-4b53-8e48-e66bc4636b3b)
-执行如下命令，可以把 LLaMA 模型部署到 NVIDIA GPU Server：
+## 快速上手
-<details close>
+### 安装
-<summary><b>7B</b></summary>
 ```shell
-python3 lmdeploy/serve/turbomind/deploy.py llama-7B /path/to/llama-7b llama \
+conda create -n lmdeploy python=3.10
-    --tokenizer_path /path/to/tokenizer/model
+conda activate lmdeploy
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
+git clone https://github.com/InternLM/lmdeploy.git
+cd lmdeploy
+pip install -e .
 ```
-</details>
+### 部署 InternLM
-<details close>
+#### 获取 InternLM 模型
-<summary><b>13B</b></summary>
 ```shell
-python3 lmdeploy/serve/turbomind/deploy.py llama-13B /path/to/llama-13b llama \
+# 1. 下载 InternLM 模型
-    --tokenizer_path /path/to/tokenizer/model --tp 2
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
-```
-</details>
+# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
+python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf
-### 部署 [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) 服务
+```
-<details open>
+#### 使用 turbomind 推理
-<summary><b>7B</b></summary>
 ```shell
-python3 -m pip install fschat
+docker run -rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
-python3 -m fastchat.model.apply_delta \
+    python3 -m lmdeploy.turbomind.chat internlm /workspace
-  --base-model-path /path/to/llama-7b \
-  --target-model-path /path/to/vicuna-7b \
-  --delta-path lmsys/vicuna-7b-delta-v1.1
-python3 lmdeploy/serve/turbomind/deploy.py vicuna-7B /path/to/vicuna-7b hf
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
 ```
-</details>
+```{note}
+turbomind 在使用 FP16 精度推理 InternLM-7B 模型时，显存开销至少需要 22.7G。建议使用 3090, V100，A100等型号的显卡
+```
-<details>
+#### 部署推理服务
-<summary><b>13B</b></summary>
+使用下面的命令启动推理服务：
 ```shell
-python3 -m pip install fschat
+bash workspace/service_docker_up.sh
-python3 -m fastchat.model.apply_delta \
-  --base-model-path /path/to/llama-13b \
-  --target-model-path /path/to/vicuna-13b \
-  --delta-path lmsys/vicuna-13b-delta-v1.1
-python3 lmdeploy/serve/turbomind/deploy.py vicuna-13B /path/to/vicuna-13b hf
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
 ```
-</details>
+你可以通过命令行方式与推理服务进行对话：
-## 通过命令行推理
 ```shell
-python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
+python3 lmdeploy.serve.client {server_ip_addresss}:33337 internlm
 ```
-## 使用浏览器推理
+也可以通过 WebUI 方式来对话：
-```shell
+```
-python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
+python3 lmdeploy.app {server_ip_addresss}:33337 internlm
 ```
 ![](https://github.com/open-mmlab/lmdeploy/assets/41138331/f4352172-d8b1-49aa-b658-50ce72b896a5)
+其他模型的部署方式，比如 LLaMA，vicuna，请参考[这里](docs/zh_cn/serving.md)
 ## 量化部署
 在 fp16 模式下，可以开启 kv_cache int8 量化，单卡可服务更多用户。

--- a/docs/en/serving.md
+++ b/docs/en/serving.md
+# Serving a model
+## Serving [LLaMA](https://github.com/facebookresearch/llama)
+Weights for the LLaMA models can be obtained from by filling out [this form](<(https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)>)
+<details open>
+<summary><b>7B</b></summary>
+```shell
+python3 lmdeploy.serve.turbomind.deploy llama-7B /path/to/llama-7b llama \
+    --tokenizer_path /path/to/tokenizer/model
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>13B</b></summary>
+```shell
+python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
+    --tokenizer_path /path/to/tokenizer/model --tp 2
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>30B</b></summary>
+```shell
+python3 lmdeploy.serve.turbomind.deploy llama-32B /path/to/llama-30b llama \
+    --tokenizer_path /path/to/tokenizer/model --tp 4
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>65B</b></summary>
+```shell
+python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
+    --tokenizer_path /path/to/tokenizer/model --tp 8
+bash workspace/service_docker_up.sh
+```
+</details>
+### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)
+<details open>
+<summary><b>7B</b></summary>
+```shell
+python3 -m pip install fschat
+python3 -m fastchat.model.apply_delta \
+  --base-model-path /path/to/llama-7b \
+  --target-model-path /path/to/vicuna-7b \
+  --delta-path lmsys/vicuna-7b-delta-v1.1
+python3 lmdeploy/serve/turbomind/deploy.py vicuna-7B /path/to/vicuna-7b hf
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>13B</b></summary>
+```shell
+python3 -m pip install fschat
+python3 -m fastchat.model.apply_delta \
+  --base-model-path /path/to/llama-13b \
+  --target-model-path /path/to/vicuna-13b \
+  --delta-path lmsys/vicuna-13b-delta-v1.1
+python3 lmdeploy/serve/turbomind/deploy.py vicuna-13B /path/to/vicuna-13b hf
+bash workspace/service_docker_up.sh
+```
+</details>
--- a/docs/zh_cn/serving.md
+++ b/docs/zh_cn/serving.md
+# 模型服务
+## 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务
+请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)，获取 LLaMA 模型权重
+<details open>
+<summary><b>7B</b></summary>
+```shell
+python3 lmdeploy.serve.turbomind.deploy llama-7B /path/to/llama-7b llama \
+    --tokenizer_path /path/to/tokenizer/model
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>13B</b></summary>
+```shell
+python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
+    --tokenizer_path /path/to/tokenizer/model --tp 2
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>30B</b></summary>
+```shell
+python3 lmdeploy.serve.turbomind.deploy llama-32B /path/to/llama-30b llama \
+    --tokenizer_path /path/to/tokenizer/model --tp 4
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>65B</b></summary>
+```shell
+python3 lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
+    --tokenizer_path /path/to/tokenizer/model --tp 8
+bash workspace/service_docker_up.sh
+```
+</details>
+### 部署 [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) 服务
+<details open>
+<summary><b>7B</b></summary>
+```shell
+python3 -m pip install fschat
+python3 -m fastchat.model.apply_delta \
+  --base-model-path /path/to/llama-7b \
+  --target-model-path /path/to/vicuna-7b \
+  --delta-path lmsys/vicuna-7b-delta-v1.1
+python3 lmdeploy/serve/turbomind/deploy.py vicuna-7B /path/to/vicuna-7b hf
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>13B</b></summary>
+```shell
+python3 -m pip install fschat
+python3 -m fastchat.model.apply_delta \
+  --base-model-path /path/to/llama-13b \
+  --target-model-path /path/to/vicuna-13b \
+  --delta-path lmsys/vicuna-13b-delta-v1.1
+python3 lmdeploy/serve/turbomind/deploy.py vicuna-13B /path/to/vicuna-13b hf
+bash workspace/service_docker_up.sh
+```
+</details>