README_zh-CN.md 8.22 KB
Newer Older
lvhan028's avatar
lvhan028 committed
1
<div align="center">
Lyu Han's avatar
Lyu Han committed
2
  <img src="resources/lmdeploy-logo.svg" width="450"/>
lvhan028's avatar
lvhan028 committed
3

RunningLeon's avatar
RunningLeon committed
4
5
6
7
8
9
10
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy-zh-cn.readthedocs.io/zh_CN/latest/)
[![badge](https://github.com/InternLM/lmdeploy/workflows/lint/badge.svg)](https://github.com/InternLM/lmdeploy/actions)
[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)

lvhan028's avatar
lvhan028 committed
11
12
13
14
[English](README.md) | 简体中文

</div>

15
<p align="center">
vansin's avatar
vansin committed
16
    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
17
</p>
lvhan028's avatar
lvhan028 committed
18

19
20
______________________________________________________________________

q.yao's avatar
q.yao committed
21
## 更新 🎉
22

q.yao's avatar
q.yao committed
23
- \[2023/08\] TurboMind 支持 flash-attention2
24
- \[2023/08\] TurboMind 支持 Qwen-7B,动态NTK-RoPE缩放,动态logN缩放
Chen Xin's avatar
Chen Xin committed
25
- \[2023/08\] TurboMind 支持 Windows (tp=1)
26
- \[2023/08\] TurboMind 支持 4-bit 推理,速度是 FP16 的 2.4 倍,是目前最快的开源实现🚀。部署方式请看[这里](./docs/zh_cn/w4a16.md)
pppppM's avatar
pppppM committed
27
28
- \[2023/08\] LMDeploy 开通了 [HuggingFace Hub](https://huggingface.co/lmdeploy) ,提供开箱即用的 4-bit 模型
- \[2023/08\] LMDeploy 支持使用 [AWQ](https://arxiv.org/abs/2306.00978) 算法进行 4-bit 量化
29
30
- \[2023/07\] TurboMind 支持使用 GQA 的 Llama-2 70B 模型
- \[2023/07\] TurboMind 支持 Llama-2 7B/13B 模型
q.yao's avatar
q.yao committed
31
- \[2023/07\] TurboMind 支持 InternLM 的 Tensor Parallel 推理
32
33
34

______________________________________________________________________

lvhan028's avatar
lvhan028 committed
35
36
## 简介

37
38
LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy)[MMRazor](https://github.com/open-mmlab/mmrazor) 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
这个强大的工具箱提供以下核心功能:
lvhan028's avatar
lvhan028 committed
39

lvhan028's avatar
lvhan028 committed
40
- **高效推理引擎 TurboMind**:基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer),我们实现了高效推理引擎 TurboMind,支持 InternLM、LLaMA、vicuna等模型在 NVIDIA GPU 上的推理。
lvhan028's avatar
lvhan028 committed
41

42
- **交互推理方式**:通过缓存多轮对话过程中 attention 的 k/v,记住对话历史,从而避免重复处理历史会话。
lvhan028's avatar
lvhan028 committed
43

tpoisonooo's avatar
tpoisonooo committed
44
- **多 GPU 部署和量化**:我们提供了全面的模型部署和量化支持,已在不同规模上完成验证。
lvhan028's avatar
lvhan028 committed
45

46
47
- **persistent batch 推理**:进一步优化模型执行效率。

pppppM's avatar
pppppM committed
48
  ![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
lvhan028's avatar
lvhan028 committed
49

pppppM's avatar
pppppM committed
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
## 支持的模型

`LMDeploy` 支持 `TurboMind``Pytorch` 两种推理后端

### TurboMind

> **Note**<br />
> W4A16 推理需要 Ampere 及以上架构的 Nvidia GPU

|   模型   | 模型并行 | FP16 | KV INT8 | W4A16 | W8A8 |
| :------: | :------: | :--: | :-----: | :---: | :--: |
|  Llama   |   Yes    | Yes  |   Yes   |  Yes  |  No  |
|  Llama2  |   Yes    | Yes  |   Yes   |  Yes  |  No  |
| InternLM |   Yes    | Yes  |   Yes   |  Yes  |  No  |

### Pytorch

|   模型   | 模型并行 | FP16 | KV INT8 | W4A16 | W8A8 |
| :------: | :------: | :--: | :-----: | :---: | :--: |
|  Llama   |   Yes    | Yes  |   No    |  No   |  No  |
|  Llama2  |   Yes    | Yes  |   No    |  No   |  No  |
| InternLM |   Yes    | Yes  |   No    |  No   |  No  |

lvhan028's avatar
lvhan028 committed
73
## 性能
lvhan028's avatar
lvhan028 committed
74

75
**场景一**: 固定的输入、输出token数(1,2048),测试 output token throughput
lvhan028's avatar
lvhan028 committed
76

77
**场景二**: 使用真实数据,测试 request throughput
lvhan028's avatar
lvhan028 committed
78

79
测试配置:LLaMA-7B, NVIDIA A100(80G)
lvhan028's avatar
lvhan028 committed
80

81
82
TurboMind 的 output token throughput 超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15%,比 huggingface transformers 提升 2.3 倍
在 request throughput 指标上,TurboMind 的效率比 vLLM 高 30%
lvhan028's avatar
lvhan028 committed
83

84
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
lvhan028's avatar
lvhan028 committed
85

lvhan028's avatar
lvhan028 committed
86
## 快速上手
lvhan028's avatar
lvhan028 committed
87

lvhan028's avatar
lvhan028 committed
88
### 安装
lvhan028's avatar
lvhan028 committed
89

90
91
使用 pip ( python 3.8+) 安装 LMDeploy,或者[源码安装](./docs/zh_cn/build.md)

lvhan028's avatar
lvhan028 committed
92
```shell
lvhan028's avatar
lvhan028 committed
93
pip install lmdeploy
lvhan028's avatar
lvhan028 committed
94
95
```

lvhan028's avatar
lvhan028 committed
96
### 部署 InternLM
lvhan028's avatar
lvhan028 committed
97

lvhan028's avatar
lvhan028 committed
98
#### 获取 InternLM 模型
lvhan028's avatar
lvhan028 committed
99
100

```shell
lvhan028's avatar
lvhan028 committed
101
# 1. 下载 InternLM 模型
lvhan028's avatar
lvhan028 committed
102

pppppM's avatar
pppppM committed
103
104
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
del-zhenwu's avatar
del-zhenwu committed
105
git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-chat-7b
pppppM's avatar
pppppM committed
106
107
108
109
110

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

lvhan028's avatar
lvhan028 committed
111
# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
112
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lvhan028's avatar
lvhan028 committed
113

lvhan028's avatar
lvhan028 committed
114
```
lvhan028's avatar
lvhan028 committed
115

lvhan028's avatar
lvhan028 committed
116
#### 使用 turbomind 推理
lvhan028's avatar
lvhan028 committed
117
118

```shell
lvhan028's avatar
lvhan028 committed
119
python3 -m lmdeploy.turbomind.chat ./workspace
lvhan028's avatar
lvhan028 committed
120
121
```

122
123
124
125
126
127
> **Note**<br />
> turbomind 在使用 FP16 精度推理 InternLM-7B 模型时,显存开销至少需要 15.7G。建议使用 3090, V100,A100等型号的显卡。<br />
> 关闭显卡的 ECC 可以腾出 10% 显存,执行 `sudo nvidia-smi --ecc-config=0` 重启系统生效。

> **Note**<br />
> 使用 Tensor 并发可以利用多张 GPU 进行推理。在 `chat` 时添加参数 `--tp=<num_gpu>` 可以启动运行时 TP。
lvhan028's avatar
lvhan028 committed
128

129
130
131
132
133
134
135
136
#### 启动 gradio server

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)

137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
#### 通过 Restful API 部署服务

使用下面的命令启动推理服务:

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
```

你可以通过命令行方式与推理服务进行对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
```

也可以通过 WebUI 方式来对话:

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
```

更多详情可以查阅 [restful_api.md](docs/zh_cn/restful_api.md)

163
#### 通过容器部署推理服务
lvhan028's avatar
lvhan028 committed
164
165

使用下面的命令启动推理服务:
lvhan028's avatar
lvhan028 committed
166
167

```shell
lvhan028's avatar
lvhan028 committed
168
bash workspace/service_docker_up.sh
lvhan028's avatar
lvhan028 committed
169
170
```

lvhan028's avatar
lvhan028 committed
171
你可以通过命令行方式与推理服务进行对话:
lvhan028's avatar
lvhan028 committed
172
173

```shell
174
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lvhan028's avatar
lvhan028 committed
175
176
```

lvhan028's avatar
lvhan028 committed
177
也可以通过 WebUI 方式来对话:
AllentDan's avatar
AllentDan committed
178

vansin's avatar
vansin committed
179
```shell
180
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
AllentDan's avatar
AllentDan committed
181
```
lvhan028's avatar
lvhan028 committed
182

183
其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)
lvhan028's avatar
lvhan028 committed
184

WRH's avatar
WRH committed
185
186
### 基于 PyTorch 的推理

187
188
189
190
191
192
你必须确保环境中有安装 deepspeed:

```
pip install deepspeed
```

WRH's avatar
WRH committed
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
#### 单个 GPU

```shell
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL\
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

#### 使用 DeepSpeed 实现张量并行

```shell
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

214
## 量化部署
lvhan028's avatar
lvhan028 committed
215

pppppM's avatar
pppppM committed
216
217
218
219
#### 权重 INT4 量化

LMDeploy 使用 [AWQ](https://arxiv.org/abs/2306.00978) 算法对模型权重进行量化

220
[点击这里](./docs/zh_cn/w4a16.md) 查看 weight int4 用法测试结果。
221

222
#### KV Cache INT8 量化
223

224
[点击这里](./docs/zh_cn/kv_int8.md) 查看 kv int8 使用方法、实现公式和测试结果。
225

226
227
228
> **Warning**<br />
> 量化部署不支持运行时 Tensor 并发。如果希望使用 Tensor 并发,需要在 deploy 时配置 tp 参数。

lvhan028's avatar
lvhan028 committed
229
230
## 贡献指南

lvhan028's avatar
lvhan028 committed
231
我们感谢所有的贡献者为改进和提升 LMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。
lvhan028's avatar
lvhan028 committed
232
233
234
235

## 致谢

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
pppppM's avatar
pppppM committed
236
- [llm-awq](https://github.com/mit-han-lab/llm-awq)
lvhan028's avatar
lvhan028 committed
237
238
239
240

## License

该项目采用 [Apache 2.0 开源许可证](LICENSE)