README.md 7.82 KB
Newer Older
lvhan028's avatar
lvhan028 committed
1
<div align="center">
lvhan028's avatar
lvhan028 committed
2
  <img src="resources/lmdeploy-logo.png" width="450"/>
lvhan028's avatar
lvhan028 committed
3
4
5
6
7

English | [简体中文](README_zh-CN.md)

</div>

8
<p align="center">
vansin's avatar
vansin committed
9
    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
10
</p>
lvhan028's avatar
lvhan028 committed
11

12
13
______________________________________________________________________

q.yao's avatar
q.yao committed
14
## News 🎉
15

pppppM's avatar
pppppM committed
16
- \[2023/08\] TurboMind supports 4-bit quantization and inference.
17
18
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
q.yao's avatar
q.yao committed
19
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
20
21
22

______________________________________________________________________

lvhan028's avatar
lvhan028 committed
23
24
## Introduction

lvhan028's avatar
lvhan028 committed
25
26
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:

tpoisonooo's avatar
tpoisonooo committed
27
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
lvhan028's avatar
lvhan028 committed
28

29
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
lvhan028's avatar
lvhan028 committed
30

tpoisonooo's avatar
tpoisonooo committed
31
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
32
33

- **Persistent Batch Inference**: Further optimization of model execution efficiency.
lvhan028's avatar
lvhan028 committed
34

pppppM's avatar
pppppM committed
35
![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
lvhan028's avatar
lvhan028 committed
36

lvhan028's avatar
lvhan028 committed
37
38
## Performance

39
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
lvhan028's avatar
lvhan028 committed
40

41
**Case II**: request throughput with real conversation data
lvhan028's avatar
lvhan028 committed
42

43
Test Setting: LLaMA-7B, NVIDIA A100(80G)
lvhan028's avatar
lvhan028 committed
44

45
46
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
lvhan028's avatar
lvhan028 committed
47

48
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
lvhan028's avatar
lvhan028 committed
49

lvhan028's avatar
lvhan028 committed
50
51
52
## Quick Start

### Installation
lvhan028's avatar
lvhan028 committed
53

54
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
lvhan028's avatar
lvhan028 committed
55
56

```shell
lvhan028's avatar
lvhan028 committed
57
pip install lmdeploy
lvhan028's avatar
lvhan028 committed
58
59
```

lvhan028's avatar
lvhan028 committed
60
### Deploy InternLM
lvhan028's avatar
lvhan028 committed
61

lvhan028's avatar
lvhan028 committed
62
#### Get InternLM model
lvhan028's avatar
lvhan028 committed
63
64

```shell
lvhan028's avatar
lvhan028 committed
65
# 1. Download InternLM model
lvhan028's avatar
lvhan028 committed
66

pppppM's avatar
pppppM committed
67
68
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
del-zhenwu's avatar
del-zhenwu committed
69
git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-chat-7b
pppppM's avatar
pppppM committed
70
71
72
73
74

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

lvhan028's avatar
lvhan028 committed
75
# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
76
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lvhan028's avatar
lvhan028 committed
77
78
79

```

lvhan028's avatar
lvhan028 committed
80
#### Inference by TurboMind
lvhan028's avatar
lvhan028 committed
81
82

```shell
lvhan028's avatar
lvhan028 committed
83
python -m lmdeploy.turbomind.chat ./workspace
lvhan028's avatar
lvhan028 committed
84
85
```

86
87
88
89
90
91
92
> **Note**<br />
> When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. <br />
> It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
> Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.

> **Note**<br />
> Tensor parallel is available to perform inference on multiple GPUs. Add `--tp=<num_gpu>` on `chat` to enable runtime TP.
lvhan028's avatar
lvhan028 committed
93

94
95
96
97
98
99
100
101
102
#### Serving with gradio

```shell
python3 -m lmdeploy.serve.gradio.app ./workspace
```

![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)

#### Serving with Triton Inference Server
lvhan028's avatar
lvhan028 committed
103

lvhan028's avatar
lvhan028 committed
104
Launch inference server by:
lvhan028's avatar
lvhan028 committed
105
106

```shell
lvhan028's avatar
lvhan028 committed
107
bash workspace/service_docker_up.sh
lvhan028's avatar
lvhan028 committed
108
109
```

lvhan028's avatar
lvhan028 committed
110
Then, you can communicate with the inference server by command line,
lvhan028's avatar
lvhan028 committed
111
112

```shell
113
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lvhan028's avatar
lvhan028 committed
114
115
```

lvhan028's avatar
lvhan028 committed
116
or webui,
AllentDan's avatar
AllentDan committed
117

vansin's avatar
vansin committed
118
```shell
119
python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337
AllentDan's avatar
AllentDan committed
120
121
```

122
For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
lvhan028's avatar
lvhan028 committed
123

WRH's avatar
WRH committed
124
125
### Inference with PyTorch

126
For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
127

WRH's avatar
WRH committed
128
129
130
#### Single GPU

```shell
WRH's avatar
WRH committed
131
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
WRH's avatar
WRH committed
132
133
134
135
136
137
138
139
140
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

#### Tensor Parallel with DeepSpeed

```shell
WRH's avatar
WRH committed
141
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
WRH's avatar
WRH committed
142
143
144
145
146
147
148
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

149
150
151
152
153
154
You need to install deepspeed first to use this feature.

```
pip install deepspeed
```

155
156
## Quantization

pppppM's avatar
pppppM committed
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
### Step 1. Obtain Quantization Parameters

First, run the quantization script to obtain the quantization parameters.

> After execution, various parameters needed for quantization will be stored in `$WORK_DIR`; these will be used in the following steps..

```
python3 -m lmdeploy.lite.apis.calibrate \
  --model $HF_MODEL \
  --calib_dataset 'c4' \             # Calibration dataset, supports c4, ptb, wikitext2, pileval
  --calib_samples 128 \              # Number of samples in the calibration set, if memory is insufficient, you can appropriately reduce this
  --calib_seqlen 2048 \              # Length of a single piece of text, if memory is insufficient, you can appropriately reduce this
  --work_dir $WORK_DIR \             # Folder storing Pytorch format quantization statistics parameters and post-quantization weight

```

### Step 2. Actual Model Quantization

`LMDeploy` supports INT4 quantization of weights and INT8 quantization of KV Cache. Run the corresponding script according to your needs.

#### Weight INT4 Quantization

LMDeploy uses AWQ algorithm for model weight quantization

> Requires input from the $WORK_DIR of step 1, and the quantized weights will also be stored in this folder.

```
python3 -m lmdeploy.lite.apis.auto_awq \
  --w_bits 4 \                       # Bit number for weight quantization
  --w_sym False \                    # Whether to use symmetric quantization for weights
  --w_group_size 128 \               # Group size for weight quantization statistics
  --work_dir $WORK_DIR \             # Directory saving quantization parameters from Step 1
```

#### KV Cache INT8 Quantization

193
In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
tpoisonooo's avatar
tpoisonooo committed
194
First execute the quantization script, and the quantization parameters are stored in the `workspace/triton_models/weights` transformed by `deploy.py`.
195
196
197

```
python3 -m lmdeploy.lite.apis.kv_qparams \
pppppM's avatar
pppppM committed
198
199
200
201
  --work_dir $WORK_DIR \             # Directory saving quantization parameters from Step 1
  --turbomind_dir $TURBOMIND_DIR \
  --kv_sym False \                   # Whether to use symmetric or asymmetric quantization.
  --num_tp 1 \                       # The number of GPUs used for tensor parallelism
202
203
```

tpoisonooo's avatar
tpoisonooo committed
204
Then adjust `workspace/triton_models/weights/config.ini`
lvhan028's avatar
lvhan028 committed
205

lvhan028's avatar
lvhan028 committed
206
207
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
lvhan028's avatar
lvhan028 committed
208

209
Here is [quantization test results](./docs/en/quantization.md).
210

211
> **Warning**<br />
tpoisonooo's avatar
tpoisonooo committed
212
> runtime Tensor Parallel for quantilized model is not available. Please setup `--tp` on `deploy` to enable static TP.
213

lvhan028's avatar
lvhan028 committed
214
## Contributing
lvhan028's avatar
lvhan028 committed
215

lvhan028's avatar
lvhan028 committed
216
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
217

lvhan028's avatar
lvhan028 committed
218
219
220
## Acknowledgement

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
pppppM's avatar
pppppM committed
221
- [llm-awq](https://github.com/mit-han-lab/llm-awq)
lvhan028's avatar
lvhan028 committed
222
223
224
225

## License

This project is released under the [Apache 2.0 license](LICENSE).