"tests/unit_tests/test_optimizer_cpu_offloading.py" did not exist on "a02a5490baab3b4745844b0f0752fe746a0cb7bc"
README.md 8.53 KB
Newer Older
lvhan028's avatar
lvhan028 committed
1
<div align="center">
lvhan028's avatar
lvhan028 committed
2
  <img src="resources/lmdeploy-logo.png" width="450"/>
lvhan028's avatar
lvhan028 committed
3

RunningLeon's avatar
RunningLeon committed
4
5
6
7
8
9
10
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![badge](https://github.com/InternLM/lmdeploy/workflows/lint/badge.svg)](https://github.com/InternLM/lmdeploy/actions)
[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)

lvhan028's avatar
lvhan028 committed
11
12
13
14
English | [简体中文](README_zh-CN.md)

</div>

15
<p align="center">
vansin's avatar
vansin committed
16
    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
17
</p>
lvhan028's avatar
lvhan028 committed
18

19
20
______________________________________________________________________

q.yao's avatar
q.yao committed
21
## News 🎉
22

q.yao's avatar
q.yao committed
23
- \[2023/08\] TurboMind supports flash-attention2.
24
- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
Chen Xin's avatar
Chen Xin committed
25
- \[2023/08\] TurboMind supports Windows (tp=1)
26
- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
pppppM's avatar
pppppM committed
27
28
- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
29
30
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
q.yao's avatar
q.yao committed
31
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
32
33
34

______________________________________________________________________

lvhan028's avatar
lvhan028 committed
35
36
## Introduction

lvhan028's avatar
lvhan028 committed
37
38
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:

tpoisonooo's avatar
tpoisonooo committed
39
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
lvhan028's avatar
lvhan028 committed
40

41
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
lvhan028's avatar
lvhan028 committed
42

tpoisonooo's avatar
tpoisonooo committed
43
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
44
45

- **Persistent Batch Inference**: Further optimization of model execution efficiency.
lvhan028's avatar
lvhan028 committed
46

pppppM's avatar
pppppM committed
47
![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
lvhan028's avatar
lvhan028 committed
48

pppppM's avatar
pppppM committed
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
## Supported Models

`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`.

### TurboMind

> **Note**<br />
> W4A16 inference requires Nvidia GPU with Ampere architecture or above.

|  Models  | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :------: | :-------------: | :--: | :-----: | :---: | :--: |
|  Llama   |       Yes       | Yes  |   Yes   |  Yes  |  No  |
|  Llama2  |       Yes       | Yes  |   Yes   |  Yes  |  No  |
| InternLM |       Yes       | Yes  |   Yes   |  Yes  |  No  |

### Pytorch

|  Models  | Tensor Parallel | FP16 | KV INT8 | W4A16 | W8A8 |
| :------: | :-------------: | :--: | :-----: | :---: | :--: |
|  Llama   |       Yes       | Yes  |   No    |  No   |  No  |
|  Llama2  |       Yes       | Yes  |   No    |  No   |  No  |
| InternLM |       Yes       | Yes  |   No    |  No   |  No  |

lvhan028's avatar
lvhan028 committed
72
73
## Performance

74
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
lvhan028's avatar
lvhan028 committed
75

76
**Case II**: request throughput with real conversation data
lvhan028's avatar
lvhan028 committed
77

78
Test Setting: LLaMA-7B, NVIDIA A100(80G)
lvhan028's avatar
lvhan028 committed
79

80
81
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
lvhan028's avatar
lvhan028 committed
82

83
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
lvhan028's avatar
lvhan028 committed
84

lvhan028's avatar
lvhan028 committed
85
86
87
## Quick Start

### Installation
lvhan028's avatar
lvhan028 committed
88

89
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
lvhan028's avatar
lvhan028 committed
90
91

```shell
lvhan028's avatar
lvhan028 committed
92
pip install lmdeploy
lvhan028's avatar
lvhan028 committed
93
94
```

lvhan028's avatar
lvhan028 committed
95
### Deploy InternLM
lvhan028's avatar
lvhan028 committed
96

lvhan028's avatar
lvhan028 committed
97
#### Get InternLM model
lvhan028's avatar
lvhan028 committed
98
99

```shell
lvhan028's avatar
lvhan028 committed
100
# 1. Download InternLM model
lvhan028's avatar
lvhan028 committed
101

pppppM's avatar
pppppM committed
102
103
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
del-zhenwu's avatar
del-zhenwu committed
104
git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-chat-7b
pppppM's avatar
pppppM committed
105
106
107
108
109

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

lvhan028's avatar
lvhan028 committed
110
# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
111
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lvhan028's avatar
lvhan028 committed
112
113
114

```

lvhan028's avatar
lvhan028 committed
115
#### Inference by TurboMind
lvhan028's avatar
lvhan028 committed
116
117

```shell
lvhan028's avatar
lvhan028 committed
118
python -m lmdeploy.turbomind.chat ./workspace
lvhan028's avatar
lvhan028 committed
119
120
```

121
122
123
124
125
126
127
> **Note**<br />
> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
> Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.

> **Note**<br />
> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
lvhan028's avatar
lvhan028 committed
128

129
130
131
132
133
134
135
136
#### Serving with gradio

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)

137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
#### Serving with Restful API

Launch inference server by:

```shell
python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1
```

Then, you can communicate with it by command line,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url
```

or webui,

```shell
# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True
```

Refer to [restful_api.md](docs/en/restful_api.md) for more details.

163
#### Serving with Triton Inference Server
lvhan028's avatar
lvhan028 committed
164

lvhan028's avatar
lvhan028 committed
165
Launch inference server by:
lvhan028's avatar
lvhan028 committed
166
167

```shell
lvhan028's avatar
lvhan028 committed
168
bash workspace/service_docker_up.sh
lvhan028's avatar
lvhan028 committed
169
170
```

lvhan028's avatar
lvhan028 committed
171
Then, you can communicate with the inference server by command line,
lvhan028's avatar
lvhan028 committed
172
173

```shell
174
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lvhan028's avatar
lvhan028 committed
175
176
```

lvhan028's avatar
lvhan028 committed
177
or webui,
AllentDan's avatar
AllentDan committed
178

vansin's avatar
vansin committed
179
```shell
180
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
AllentDan's avatar
AllentDan committed
181
182
```

183
For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
lvhan028's avatar
lvhan028 committed
184

WRH's avatar
WRH committed
185
186
### Inference with PyTorch

187
For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
188

WRH's avatar
WRH committed
189
190
191
#### Single GPU

```shell
WRH's avatar
WRH committed
192
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
WRH's avatar
WRH committed
193
194
195
196
197
198
199
200
201
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

#### Tensor Parallel with DeepSpeed

```shell
WRH's avatar
WRH committed
202
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
WRH's avatar
WRH committed
203
204
205
206
207
208
209
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

210
211
212
213
214
215
You need to install deepspeed first to use this feature.

```
pip install deepspeed
```

216
217
## Quantization

pppppM's avatar
pppppM committed
218
219
#### Weight INT4 Quantization

220
LMDeploy uses [AWQ](https://arxiv.org/abs/2306.00978) algorithm for model weight quantization
pppppM's avatar
pppppM committed
221

222
[Click here](./docs/zh_cn/w4a16.md) to view the test results for weight int4 usage.
223

224
#### KV Cache INT8 Quantization
lvhan028's avatar
lvhan028 committed
225

226
[Click here](./docs/zh_cn/kv_int8.md) to view the usage method, implementation formula, and test results for kv int8.
227

228
> **Warning**<br />
tpoisonooo's avatar
tpoisonooo committed
229
> runtime Tensor Parallel for quantilized model is not available. Please setup `--tp` on `deploy` to enable static TP.
230

lvhan028's avatar
lvhan028 committed
231
## Contributing
lvhan028's avatar
lvhan028 committed
232

lvhan028's avatar
lvhan028 committed
233
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
234

lvhan028's avatar
lvhan028 committed
235
236
237
## Acknowledgement

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
pppppM's avatar
pppppM committed
238
- [llm-awq](https://github.com/mit-han-lab/llm-awq)
lvhan028's avatar
lvhan028 committed
239
240
241
242

## License

This project is released under the [Apache 2.0 license](LICENSE).