README.md 9.73 KB
Newer Older
lvhan028's avatar
lvhan028 committed
1
<div align="center">
lvhan028's avatar
lvhan028 committed
2
  <img src="resources/lmdeploy-logo.png" width="450"/>
lvhan028's avatar
lvhan028 committed
3

RunningLeon's avatar
RunningLeon committed
4
5
6
7
8
9
10
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![badge](https://github.com/InternLM/lmdeploy/workflows/lint/badge.svg)](https://github.com/InternLM/lmdeploy/actions)
[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)

lvhan028's avatar
lvhan028 committed
11
12
13
14
English | [简体中文](README_zh-CN.md)

</div>

15
<p align="center">
vansin's avatar
vansin committed
16
    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
17
</p>
lvhan028's avatar
lvhan028 committed
18

19
20
______________________________________________________________________

q.yao's avatar
q.yao committed
21
## News 🎉
22

23
- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
Chen Xin's avatar
Chen Xin committed
24
- \[2023/08\] TurboMind supports Windows (tp=1)
25
- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
pppppM's avatar
pppppM committed
26
27
- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
28
29
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
q.yao's avatar
q.yao committed
30
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
31
32
33

______________________________________________________________________

lvhan028's avatar
lvhan028 committed
34
35
## Introduction

lvhan028's avatar
lvhan028 committed
36
37
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:

tpoisonooo's avatar
tpoisonooo committed
38
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
lvhan028's avatar
lvhan028 committed
39

40
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
lvhan028's avatar
lvhan028 committed
41

tpoisonooo's avatar
tpoisonooo committed
42
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
43
44

- **Persistent Batch Inference**: Further optimization of model execution efficiency.
lvhan028's avatar
lvhan028 committed
45

pppppM's avatar
pppppM committed
46
![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
lvhan028's avatar
lvhan028 committed
47

pppppM's avatar
pppppM committed
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
## Supported Models

`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.

### TurboMind

> **Note**<br />
> W4A16 inference requires Nvidia GPU with Ampere architecture or above.

|  Models  | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :------: | :-------------: | :--: | :-----: | :---: | :--: |
|  Llama   |       Yes       | Yes  |   Yes   |  Yes  |  No  |
|  Llama2  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
| InternLM |       Yes       | Yes  |   Yes   |  Yes  |  No  |

### Pytorch

|  Models  | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :------: | :-------------: | :--: | :-----: | :---: | :--: |
|  Llama   |       Yes       | Yes  |   No    |  No   |  No  |
|  Llama2  |       Yes       | Yes  |   No    |  No   |  No  |
| InternLM |       Yes       | Yes  |   No    |  No   |  No  |

lvhan028's avatar
lvhan028 committed
71
72
## Performance

73
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
lvhan028's avatar
lvhan028 committed
74

75
**Case II**: request throughput with real conversation data
lvhan028's avatar
lvhan028 committed
76

77
Test Setting: LLaMA-7B, NVIDIA A100(80G)
lvhan028's avatar
lvhan028 committed
78

79
80
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
lvhan028's avatar
lvhan028 committed
81

82
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
lvhan028's avatar
lvhan028 committed
83

lvhan028's avatar
lvhan028 committed
84
85
86
## Quick Start

### Installation
lvhan028's avatar
lvhan028 committed
87

88
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
lvhan028's avatar
lvhan028 committed
89
90

```shell
lvhan028's avatar
lvhan028 committed
91
pip install lmdeploy
lvhan028's avatar
lvhan028 committed
92
93
```

lvhan028's avatar
lvhan028 committed
94
### Deploy InternLM
lvhan028's avatar
lvhan028 committed
95

lvhan028's avatar
lvhan028 committed
96
#### Get InternLM model
lvhan028's avatar
lvhan028 committed
97
98

```shell
lvhan028's avatar
lvhan028 committed
99
# 1. Download InternLM model
lvhan028's avatar
lvhan028 committed
100

pppppM's avatar
pppppM committed
101
102
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
del-zhenwu's avatar
del-zhenwu committed
103
git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-chat-7b
pppppM's avatar
pppppM committed
104
105
106
107
108

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

lvhan028's avatar
lvhan028 committed
109
# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
110
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lvhan028's avatar
lvhan028 committed
111
112
113

```

lvhan028's avatar
lvhan028 committed
114
#### Inference by TurboMind
lvhan028's avatar
lvhan028 committed
115
116

```shell
lvhan028's avatar
lvhan028 committed
117
python -m lmdeploy.turbomind.chat ./workspace
lvhan028's avatar
lvhan028 committed
118
119
```

120
121
122
123
124
125
126
> **Note**<br />
> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
> Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.

> **Note**<br />
> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
lvhan028's avatar
lvhan028 committed
127

128
129
130
131
132
133
134
135
136
#### Serving with gradio

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)

#### Serving with Triton Inference Server
lvhan028's avatar
lvhan028 committed
137

lvhan028's avatar
lvhan028 committed
138
Launch inference server by:
lvhan028's avatar
lvhan028 committed
139
140

```shell
lvhan028's avatar
lvhan028 committed
141
bash workspace/service_docker_up.sh
lvhan028's avatar
lvhan028 committed
142
143
```

lvhan028's avatar
lvhan028 committed
144
Then, you can communicate with the inference server by command line,
lvhan028's avatar
lvhan028 committed
145
146

```shell
147
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lvhan028's avatar
lvhan028 committed
148
149
```

lvhan028's avatar
lvhan028 committed
150
or webui,
AllentDan's avatar
AllentDan committed
151

vansin's avatar
vansin committed
152
```shell
153
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
AllentDan's avatar
AllentDan committed
154
155
```

156
For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
lvhan028's avatar
lvhan028 committed
157

WRH's avatar
WRH committed
158
159
### Inference with PyTorch

160
For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
161

WRH's avatar
WRH committed
162
163
164
#### Single GPU

```shell
WRH's avatar
WRH committed
165
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
WRH's avatar
WRH committed
166
167
168
169
170
171
172
173
174
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

#### Tensor Parallel with DeepSpeed

```shell
WRH's avatar
WRH committed
175
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
WRH's avatar
WRH committed
176
177
178
179
180
181
182
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

183
184
185
186
187
188
You need to install deepspeed first to use this feature.

```
pip install deepspeed
```

189
190
## Quantization

pppppM's avatar
pppppM committed
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
### Step 1. Obtain Quantization Parameters

First, run the quantization script to obtain the quantization parameters.

> After execution, various parameters needed for quantization will be stored in `$WORK_DIR`; these will be used in the following steps..

```
python3 -m lmdeploy.lite.apis.calibrate \
  --model $HF_MODEL \
  --calib_dataset 'c4' \             # Calibration dataset, supports c4, ptb, wikitext2, pileval
  --calib_samples 128 \              # Number of samples in the calibration set, if memory is insufficient, you can appropriately reduce this
  --calib_seqlen 2048 \              # Length of a single piece of text, if memory is insufficient, you can appropriately reduce this
  --work_dir $WORK_DIR \             # Folder storing Pytorch format quantization statistics parameters and post-quantization weight

```

### Step 2. Actual Model Quantization

`LMDeploy` supports INT4 quantization of weights and INT8 quantization of KV Cache. Run the corresponding script according to your needs.

#### Weight INT4 Quantization

LMDeploy uses AWQ algorithm for model weight quantization

> Requires input from the $WORK_DIR of step 1, and the quantized weights will also be stored in this folder.

```
python3 -m lmdeploy.lite.apis.auto_awq \
AllentDan's avatar
AllentDan committed
219
  --model $HF_MODEL \
pppppM's avatar
pppppM committed
220
221
222
223
224
225
226
  --w_bits 4 \                       # Bit number for weight quantization
  --w_group_size 128 \               # Group size for weight quantization statistics
  --work_dir $WORK_DIR \             # Directory saving quantization parameters from Step 1
```

#### KV Cache INT8 Quantization

227
In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
tpoisonooo's avatar
tpoisonooo committed
228
First execute the quantization script, and the quantization parameters are stored in the `workspace/triton_models/weights` transformed by `deploy.py`.
229
230
231

```
python3 -m lmdeploy.lite.apis.kv_qparams \
pppppM's avatar
pppppM committed
232
233
234
235
  --work_dir $WORK_DIR \             # Directory saving quantization parameters from Step 1
  --turbomind_dir $TURBOMIND_DIR \
  --kv_sym False \                   # Whether to use symmetric or asymmetric quantization.
  --num_tp 1 \                       # The number of GPUs used for tensor parallelism
236
237
```

tpoisonooo's avatar
tpoisonooo committed
238
Then adjust `workspace/triton_models/weights/config.ini`
lvhan028's avatar
lvhan028 committed
239

lvhan028's avatar
lvhan028 committed
240
241
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
lvhan028's avatar
lvhan028 committed
242

243
Here is [quantization test results](./docs/en/kv_int8.md).
244

245
> **Warning**<br />
tpoisonooo's avatar
tpoisonooo committed
246
> runtime Tensor Parallel for quantilized model is not available. Please setup `--tp` on `deploy` to enable static TP.
247

lvhan028's avatar
lvhan028 committed
248
## Contributing
lvhan028's avatar
lvhan028 committed
249

lvhan028's avatar
lvhan028 committed
250
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
251

lvhan028's avatar
lvhan028 committed
252
253
254
## Acknowledgement

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
pppppM's avatar
pppppM committed
255
- [llm-awq](https://github.com/mit-han-lab/llm-awq)
lvhan028's avatar
lvhan028 committed
256
257
258
259

## License

This project is released under the [Apache 2.0 license](LICENSE).