README.md 10.1 KB
Newer Older
lvhan028's avatar
lvhan028 committed
1
<div align="center">
Lyu Han's avatar
Lyu Han committed
2
  <img src="resources/lmdeploy-logo.svg" width="450"/>
lvhan028's avatar
lvhan028 committed
3

RunningLeon's avatar
RunningLeon committed
4
5
6
7
8
9
10
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![badge](https://github.com/InternLM/lmdeploy/workflows/lint/badge.svg)](https://github.com/InternLM/lmdeploy/actions)
[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)

lvhan028's avatar
lvhan028 committed
11
12
13
14
English | [简体中文](README_zh-CN.md)

</div>

15
<p align="center">
vansin's avatar
vansin committed
16
    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
17
</p>
lvhan028's avatar
lvhan028 committed
18

19
20
______________________________________________________________________

q.yao's avatar
q.yao committed
21
## News 🎉
22

23
- \[2023/11\] Turbomind supports loading hf model directly. Click [here](./docs/en/load_hf.md) for details.
24
- \[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
Chen Xin's avatar
Chen Xin committed
25
- \[2023/09\] TurboMind supports Qwen-14B
Lyu Han's avatar
Lyu Han committed
26
- \[2023/09\] TurboMind supports InternLM-20B
Lyu Han's avatar
Lyu Han committed
27
- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
28
- \[2023/09\] TurboMind supports Baichuan2-7B
q.yao's avatar
q.yao committed
29
- \[2023/08\] TurboMind supports flash-attention2.
30
- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
Chen Xin's avatar
Chen Xin committed
31
- \[2023/08\] TurboMind supports Windows (tp=1)
32
- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
pppppM's avatar
pppppM committed
33
34
- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
35
36
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
q.yao's avatar
q.yao committed
37
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
38
39
40

______________________________________________________________________

lvhan028's avatar
lvhan028 committed
41
42
## Introduction

lvhan028's avatar
lvhan028 committed
43
44
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:

tpoisonooo's avatar
tpoisonooo committed
45
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
lvhan028's avatar
lvhan028 committed
46

47
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
lvhan028's avatar
lvhan028 committed
48

tpoisonooo's avatar
tpoisonooo committed
49
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
50
51

- **Persistent Batch Inference**: Further optimization of model execution efficiency.
lvhan028's avatar
lvhan028 committed
52

pppppM's avatar
pppppM committed
53
![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
lvhan028's avatar
lvhan028 committed
54

pppppM's avatar
pppppM committed
55
56
## Supported Models

57
`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`. You can run `lmdeploy list` to check the supported model names.
pppppM's avatar
pppppM committed
58
59
60
61
62
63

### TurboMind

> **Note**<br />
> W4A16 inference requires Nvidia GPU with Ampere architecture or above.

Lyu Han's avatar
Lyu Han committed
64
65
66
67
|    Models    | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :----------: | :-------------: | :--: | :-----: | :---: | :--: |
|    Llama     |       Yes       | Yes  |   Yes   |  Yes  |  No  |
|    Llama2    |       Yes       | Yes  |   Yes   |  Yes  |  No  |
AllentDan's avatar
AllentDan committed
68
|    SOLAR     |       Yes       | Yes  |   Yes   |  Yes  |  No  |
Lyu Han's avatar
Lyu Han committed
69
70
| InternLM-7B  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
| InternLM-20B |       Yes       | Yes  |   Yes   |  Yes  |  No  |
pppppM's avatar
pppppM committed
71
72
|   QWen-7B    |       Yes       | Yes  |   Yes   |  Yes  |  No  |
|   QWen-14B   |       Yes       | Yes  |   Yes   |  Yes  |  No  |
Lyu Han's avatar
Lyu Han committed
73
| Baichuan-7B  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
pppppM's avatar
pppppM committed
74
| Baichuan2-7B |       Yes       | Yes  |   Yes   |  Yes  |  No  |
Lyu Han's avatar
Lyu Han committed
75
|  Code Llama  |       Yes       | Yes  |   No    |  No   |  No  |
pppppM's avatar
pppppM committed
76
77
78

### Pytorch

Lyu Han's avatar
Lyu Han committed
79
80
81
82
83
|   Models    | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :---------: | :-------------: | :--: | :-----: | :---: | :--: |
|    Llama    |       Yes       | Yes  |   No    |  No   |  No  |
|   Llama2    |       Yes       | Yes  |   No    |  No   |  No  |
| InternLM-7B |       Yes       | Yes  |   No    |  No   |  No  |
pppppM's avatar
pppppM committed
84

lvhan028's avatar
lvhan028 committed
85
86
## Performance

87
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
lvhan028's avatar
lvhan028 committed
88

89
**Case II**: request throughput with real conversation data
lvhan028's avatar
lvhan028 committed
90

91
Test Setting: LLaMA-7B, NVIDIA A100(80G)
lvhan028's avatar
lvhan028 committed
92

93
94
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
lvhan028's avatar
lvhan028 committed
95

96
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
lvhan028's avatar
lvhan028 committed
97

lvhan028's avatar
lvhan028 committed
98
99
100
## Quick Start

### Installation
lvhan028's avatar
lvhan028 committed
101

102
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
lvhan028's avatar
lvhan028 committed
103
104

```shell
lvhan028's avatar
lvhan028 committed
105
pip install lmdeploy
lvhan028's avatar
lvhan028 committed
106
107
```

108
109
110
111
112
113
114
115
> **Note**<br />
> `pip install lmdeploy` can only install the runtime required packages. If users want to run codes from modules like `lmdeploy.lite` and `lmdeploy.serve`, they need to install the extra required packages.
> For instance, running `pip install lmdeploy[lite]` would install extra dependencies for `lmdeploy.lite` module.
>
> - `all`: Install lmdeploy with all dependencies in `requirements.txt`
> - `lite`: Install lmdeploy with extra dependencies in `requirements/lite.txt`
> - `serve`: Install lmdeploy with dependencies in `requirements/serve.txt`

lvhan028's avatar
lvhan028 committed
116
### Deploy InternLM
lvhan028's avatar
lvhan028 committed
117

118
To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.
lvhan028's avatar
lvhan028 committed
119

120
The following use [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.
lvhan028's avatar
lvhan028 committed
121

lvhan028's avatar
lvhan028 committed
122
#### Inference by TurboMind
lvhan028's avatar
lvhan028 committed
123
124

```shell
125
lmdeploy chat turbomind internlm/internlm-chat-7b --model-name internlm-chat-7b
lvhan028's avatar
lvhan028 committed
126
127
```

128
> **Note**<br /> The internlm/internlm-chat-7b model will be downloaded under `.cache` folder. You can also use a local path here.
129

130
131
132
133
134
135
136
> **Note**<br />
> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
> Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.

> **Note**<br />
> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
lvhan028's avatar
lvhan028 committed
137

138
139
140
#### Serving with gradio

```shell
141
142
143
# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

144
lmdeploy serve gradio internlm/internlm-chat-7b --model-name internlm-chat-7b
145
146
147
148
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)

149
150
151
152
153
#### Serving with Restful API

Launch inference server by:

```shell
154
155
156
# install lmdeploy with extra dependencies
pip install lmdeploy[serve]

157
lmdeploy serve api_server internlm/internlm-chat-7b --model-name internlm-chat-7b --instance_num 32 --tp 1
158
159
160
161
162
```

Then, you can communicate with it by command line,

```shell
163
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
164
lmdeploy serve api_client api_server_url
165
166
167
168
169
```

or webui,

```shell
170
# api_server_url is what printed in api_server.py, e.g. http://localhost:23333
171
# server_ip and server_port here are for gradio ui
172
173
# example: lmdeploy serve gradio http://localhost:23333 --server_name localhost --server_port 6006
lmdeploy serve gradio api_server_url --server_name ${gradio_ui_ip} --server_port ${gradio_ui_port}
174
175
176
177
```

Refer to [restful_api.md](docs/en/restful_api.md) for more details.

WRH's avatar
WRH committed
178
179
### Inference with PyTorch

180
For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
181

WRH's avatar
WRH committed
182
183
184
#### Single GPU

```shell
185
lmdeploy chat torch $NAME_OR_PATH_TO_HF_MODEL \
WRH's avatar
WRH committed
186
187
188
189
190
191
192
193
194
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

#### Tensor Parallel with DeepSpeed

```shell
WRH's avatar
WRH committed
195
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
WRH's avatar
WRH committed
196
197
198
199
200
201
202
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

203
204
205
206
207
208
You need to install deepspeed first to use this feature.

```
pip install deepspeed
```

209
210
## Quantization

pppppM's avatar
pppppM committed
211
212
#### Weight INT4 Quantization

213
LMDeploy uses [AWQ](https://arxiv.org/abs/2306.00978) algorithm for model weight quantization
pppppM's avatar
pppppM committed
214

215
[Click here](./docs/en/w4a16.md) to view the test results for weight int4 usage.
216

217
#### KV Cache INT8 Quantization
lvhan028's avatar
lvhan028 committed
218

219
[Click here](./docs/en/kv_int8.md) to view the usage method, implementation formula, and test results for kv int8.
220

221
> **Warning**<br />
222
> runtime Tensor Parallel for quantized model is not available. Please setup `--tp` on `deploy` to enable static TP.
223

lvhan028's avatar
lvhan028 committed
224
## Contributing
lvhan028's avatar
lvhan028 committed
225

lvhan028's avatar
lvhan028 committed
226
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
227

lvhan028's avatar
lvhan028 committed
228
229
230
## Acknowledgement

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
pppppM's avatar
pppppM committed
231
- [llm-awq](https://github.com/mit-han-lab/llm-awq)
lvhan028's avatar
lvhan028 committed
232
233
234
235

## License

This project is released under the [Apache 2.0 license](LICENSE).