README.md 7.55 KB
Newer Older
lvhan028's avatar
lvhan028 committed
1
<div align="center">
lvhan028's avatar
lvhan028 committed
2
  <img src="resources/lmdeploy-logo.png" width="450"/>
lvhan028's avatar
lvhan028 committed
3

lvhan028's avatar
lvhan028 committed
4
5
6
7
8
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
[![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
lvhan028's avatar
lvhan028 committed
9
10
11
12
13
14
15

English | [简体中文](README_zh-CN.md)

</div>

<div align="center">
  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
lvhan028's avatar
lvhan028 committed
16
    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
lvhan028's avatar
lvhan028 committed
17
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
lvhan028's avatar
lvhan028 committed
18
  <a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
lvhan028's avatar
lvhan028 committed
19
20
21
22
23
24
25
    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218346637-d30c8a0f-3eba-4699-8131-512fb06d46db.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
lvhan028's avatar
lvhan028 committed
26
27
28
29
30
31
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
lvhan028's avatar
lvhan028 committed
32
33
34
35
</div>

## Introduction

lvhan028's avatar
lvhan028 committed
36
37
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:

tpoisonooo's avatar
tpoisonooo committed
38
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
lvhan028's avatar
lvhan028 committed
39

40
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
lvhan028's avatar
lvhan028 committed
41
42

<div align="center">
43
  <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true"/>
lvhan028's avatar
lvhan028 committed
44
45
</div>

46
47
48
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive support for model deployment and quantization, and have successfully validated it on models ranging from 7B to 100B parameters.

- **Persistent Batch Inference**: Further optimization of model execution efficiency.
lvhan028's avatar
lvhan028 committed
49

50
![PersistentBatchInference](https://github.com/open-mmlab/lmdeploy/assets/25839884/8f8b57b8-42af-4b71-ad74-e75f39b10694)
lvhan028's avatar
lvhan028 committed
51
52
53
54

## Quick Start

### Installation
lvhan028's avatar
lvhan028 committed
55
56
57
58
59
60

Below are quick steps for installation:

```shell
conda create -n open-mmlab python=3.8
conda activate open-mmlab
lvhan028's avatar
lvhan028 committed
61
62
git clone https://github.com/open-mmlab/lmdeploy.git
cd lmdeploy
lvhan028's avatar
lvhan028 committed
63
64
65
pip install -e .
```

lvhan028's avatar
lvhan028 committed
66
67
### Build

lvhan028's avatar
lvhan028 committed
68
Pull docker image `openmmlab/lmdeploy:latest` and build lmdeploy libs in its launched container
lvhan028's avatar
lvhan028 committed
69
70
71
72
73
74
75

```shell
mkdir build && cd build
../generate.sh
make -j$(nproc) && make install
```

lvhan028's avatar
lvhan028 committed
76
77
78
79
80
81
### Serving [LLaMA](https://github.com/facebookresearch/llama)

Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform?usp=send_form)

Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:

tpoisonooo's avatar
tpoisonooo committed
82
<details close>
lvhan028's avatar
lvhan028 committed
83
84
85
<summary><b>7B</b></summary>

```shell
lvhan028's avatar
lvhan028 committed
86
python3 lmdeploy/serve/turbomind/deploy.py llama-7B /path/to/llama-7b llama \
lvhan028's avatar
lvhan028 committed
87
    --tokenizer_path /path/to/tokenizer/model
88
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
lvhan028's avatar
lvhan028 committed
89
90
91
92
```

</details>

tpoisonooo's avatar
tpoisonooo committed
93
<details close>
lvhan028's avatar
lvhan028 committed
94
95
96
<summary><b>13B</b></summary>

```shell
lvhan028's avatar
lvhan028 committed
97
python3 lmdeploy/serve/turbomind/deploy.py llama-13B /path/to/llama-13b llama \
lvhan028's avatar
lvhan028 committed
98
    --tokenizer_path /path/to/tokenizer/model --tp 2
99
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
lvhan028's avatar
lvhan028 committed
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
```

</details>

### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)

<details open>
<summary><b>7B</b></summary>

```shell
python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
  --base-model-path /path/to/llama-7b \
  --target-model-path /path/to/vicuna-7b \
  --delta-path lmsys/vicuna-7b-delta-v1.1

lvhan028's avatar
lvhan028 committed
116
python3 lmdeploy/serve/turbomind/deploy.py vicuna-7B /path/to/vicuna-7b hf
117
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
lvhan028's avatar
lvhan028 committed
118
119
120
121
122
123
124
125
126
127
128
129
130
131
```

</details>

<details>
<summary><b>13B</b></summary>

```shell
python3 -m pip install fschat
python3 -m fastchat.model.apply_delta \
  --base-model-path /path/to/llama-13b \
  --target-model-path /path/to/vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1

lvhan028's avatar
lvhan028 committed
132
python3 lmdeploy/serve/turbomind/deploy.py vicuna-13B /path/to/vicuna-13b hf
133
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/turbomind
lvhan028's avatar
lvhan028 committed
134
135
136
137
138
139
140
```

</details>

## Inference with Command Line Interface

```shell
lvhan028's avatar
lvhan028 committed
141
python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
lvhan028's avatar
lvhan028 committed
142
143
```

AllentDan's avatar
AllentDan committed
144
145
146
## Inference with Web UI

```shell
lvhan028's avatar
lvhan028 committed
147
python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
AllentDan's avatar
AllentDan committed
148
149
150
```

## User Guide
lvhan028's avatar
lvhan028 committed
151

152
153
154
155
## Quantization

In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`.
156
157
158
159
160
161
162
163
164
165
166
167
168
169

```


python3 -m lmdeploy.lite.apis.kv_qparams \
  --model $HF_MODEL \
  --output_dir $DEPLOY_WEIGHT_DIR \
  --symmetry True \   # Whether to use symmetric or asymmetric quantization.
  --offload  False \  # Whether to offload some modules to CPU to save GPU memory.
  --num_tp 1 \   # The number of GPUs used for tensor parallelism


```

170
Then adjust `config.ini`
lvhan028's avatar
lvhan028 committed
171

lvhan028's avatar
lvhan028 committed
172
173
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
lvhan028's avatar
lvhan028 committed
174

175
176
Here is [quantization test results](./docs/zh_cn/quantization.md).

lvhan028's avatar
lvhan028 committed
177
## Contributing
lvhan028's avatar
lvhan028 committed
178

lvhan028's avatar
lvhan028 committed
179
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
180

lvhan028's avatar
lvhan028 committed
181
182
183
184
185
186
187
## Acknowledgement

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)

## License

This project is released under the [Apache 2.0 license](LICENSE).