README_zh-CN.md 7.29 KB
Newer Older
lvhan028's avatar
lvhan028 committed
1
<div align="center">
lvhan028's avatar
lvhan028 committed
2
  <img src="resources/lmdeploy-logo.png" width="450"/>
lvhan028's avatar
lvhan028 committed
3

lvhan028's avatar
lvhan028 committed
4
5
6
7
8
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
[![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
lvhan028's avatar
lvhan028 committed
9
10
11
12
13
14
15

[English](README.md) | 简体中文

</div>

<div align="center">
  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
lvhan028's avatar
lvhan028 committed
16
    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
lvhan028's avatar
lvhan028 committed
17
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
lvhan028's avatar
lvhan028 committed
18
  <a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
lvhan028's avatar
lvhan028 committed
19
20
21
22
23
24
25
    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218346637-d30c8a0f-3eba-4699-8131-512fb06d46db.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
lvhan028's avatar
lvhan028 committed
26
27
28
29
30
31
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
lvhan028's avatar
lvhan028 committed
32
33
34
35
</div>

## 简介

36
37
LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy)[MMRazor](https://github.com/open-mmlab/mmrazor) 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。
这个强大的工具箱提供以下核心功能:
lvhan028's avatar
lvhan028 committed
38

39
- **高效推理引擎 TurboMind**:基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer),我们实现了高效推理引擎 TurboMind,它支持 LLaMA 及其变体模型在 NVIDIA GPU 上的推理。
lvhan028's avatar
lvhan028 committed
40

41
- **交互推理方式**:通过缓存多轮对话过程中 attention 的 k/v,记住对话历史,从而避免重复处理历史会话。
lvhan028's avatar
lvhan028 committed
42

43
44
45
46
  <div align="center">
    <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/>
  </div>

tpoisonooo's avatar
tpoisonooo committed
47
- **多 GPU 部署和量化**:我们提供了全面的模型部署和量化支持,已在不同规模上完成验证。
lvhan028's avatar
lvhan028 committed
48

49
50
51
- **persistent batch 推理**:进一步优化模型执行效率。

  ![PersistentBatchInference](https://github.com/open-mmlab/lmdeploy/assets/25839884/8f8b57b8-42af-4b71-ad74-e75f39b10694)
lvhan028's avatar
lvhan028 committed
52
53
54
55

## 快速上手

### 安装
lvhan028's avatar
lvhan028 committed
56
57
58
59

```shell
conda create -n open-mmlab python=3.8
conda activate open-mmlab
lvhan028's avatar
lvhan028 committed
60
61
git clone https://github.com/open-mmlab/lmdeploy.git
cd lmdeploy
lvhan028's avatar
lvhan028 committed
62
63
64
pip install -e .
```

lvhan028's avatar
lvhan028 committed
65
66
67
68
69
70
71
72
73
### 编译

下载 docker image `openmmlab/lmdeploy:latest`,挂载 lmdeploy 的数据卷,启动 container,在 container 内执行以下命令:

```shell
mkdir build && cd build
../generate.sh
make -j$(nproc) && make install
```
lvhan028's avatar
lvhan028 committed
74
75
76

### 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务

lvhan028's avatar
lvhan028 committed
77
请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform),获取 LLaMA 模型权重。
lvhan028's avatar
lvhan028 committed
78

lvhan028's avatar
lvhan028 committed
79
执行如下命令,可以把 LLaMA 模型部署到 NVIDIA GPU Server:
lvhan028's avatar
lvhan028 committed
80

tpoisonooo's avatar
tpoisonooo committed
81
<details close>
lvhan028's avatar
lvhan028 committed
82
83
84
<summary><b>7B</b></summary>

```shell
lvhan028's avatar
lvhan028 committed
85
python3 lmdeploy/serve/turbomind/deploy.py llama-7B /path/to/llama-7b llama \
lvhan028's avatar
lvhan028 committed
86
    --tokenizer_path /path/to/tokenizer/model
87
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
lvhan028's avatar
lvhan028 committed
88
89
90
91
```

</details>

tpoisonooo's avatar
tpoisonooo committed
92
<details close>
lvhan028's avatar
lvhan028 committed
93
94
95
<summary><b>13B</b></summary>

```shell
lvhan028's avatar
lvhan028 committed
96
python3 lmdeploy/serve/turbomind/deploy.py llama-13B /path/to/llama-13b llama \
lvhan028's avatar
lvhan028 committed
97
    --tokenizer_path /path/to/tokenizer/model --tp 2
98
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
lvhan028's avatar
lvhan028 committed
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
```

</details>

### 部署 [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) 服务

<details open>
<summary><b>7B</b></summary>

```shell
python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
  --base-model-path /path/to/llama-7b \
  --target-model-path /path/to/vicuna-7b \
  --delta-path lmsys/vicuna-7b-delta-v1.1

lvhan028's avatar
lvhan028 committed
115
python3 lmdeploy/serve/turbomind/deploy.py vicuna-7B /path/to/vicuna-7b hf
116
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
lvhan028's avatar
lvhan028 committed
117
118
119
120
121
122
123
124
125
126
127
128
129
130
```

</details>

<details>
<summary><b>13B</b></summary>

```shell
python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
  --base-model-path /path/to/llama-13b \
  --target-model-path /path/to/vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1

lvhan028's avatar
lvhan028 committed
131
python3 lmdeploy/serve/turbomind/deploy.py vicuna-13B /path/to/vicuna-13b hf
132
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
lvhan028's avatar
lvhan028 committed
133
134
135
136
137
138
139
```

</details>

## 通过命令行推理

```shell
lvhan028's avatar
lvhan028 committed
140
python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
lvhan028's avatar
lvhan028 committed
141
142
```

AllentDan's avatar
AllentDan committed
143
144
145
## 使用浏览器推理

```shell
lvhan028's avatar
lvhan028 committed
146
python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
AllentDan's avatar
AllentDan committed
147
```
lvhan028's avatar
lvhan028 committed
148

149
## 量化部署
lvhan028's avatar
lvhan028 committed
150

151
152
在 fp16 模式下,可以开启 kv_cache int8 量化,单卡可服务更多用户。
首先执行量化脚本,量化参数存放到 `deploy.py` 转换的 weight 目录下。
153
154
155
156
157
158
159
160
161
162

```
python3 -m lmdeploy.lite.apis.kv_qparams \
  --model $HF_MODEL \
  --output_dir $DEPLOY_WEIGHT_DIR \
  --symmetry True \ # 对称量化或非对称量化,默认为 True
  --offload  False \ # 将模型放在 CPU,只在推理时加载部分模块到 GPU,默认为 False
  --num_tp 1  \  # Tensor 并行使用的 GPU 数,和 deploy.py 保持一致
```

163
然后调整 `config.ini`
lvhan028's avatar
lvhan028 committed
164
165
166

- `use_context_fmha` 改为 0,表示关闭
- `quant_policy` 设置为 4。此参数默认为 0,表示不开启
167

168
169
这里是[量化测试结果](./docs/zh_cn/quantization.md)

lvhan028's avatar
lvhan028 committed
170
171
## 贡献指南

lvhan028's avatar
lvhan028 committed
172
我们感谢所有的贡献者为改进和提升 LMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。
lvhan028's avatar
lvhan028 committed
173
174
175
176
177
178
179
180

## 致谢

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)

## License

该项目采用 [Apache 2.0 开源许可证](LICENSE)