README.md 5.61 KB
Newer Older
lvhan028's avatar
lvhan028 committed
1
<div align="center">
lvhan028's avatar
lvhan028 committed
2
  <img src="resources/lmdeploy-logo.png" width="450"/>
lvhan028's avatar
lvhan028 committed
3
4
5
6
7

English | [简体中文](README_zh-CN.md)

</div>

8
<p align="center">
vansin's avatar
vansin committed
9
    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
10
</p>
lvhan028's avatar
lvhan028 committed
11

12
13
______________________________________________________________________

q.yao's avatar
q.yao committed
14
## News 🎉
15

16
17
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
q.yao's avatar
q.yao committed
18
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
19
20
21

______________________________________________________________________

lvhan028's avatar
lvhan028 committed
22
23
## Introduction

lvhan028's avatar
lvhan028 committed
24
25
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:

tpoisonooo's avatar
tpoisonooo committed
26
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
lvhan028's avatar
lvhan028 committed
27

28
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
lvhan028's avatar
lvhan028 committed
29

tpoisonooo's avatar
tpoisonooo committed
30
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
31
32

- **Persistent Batch Inference**: Further optimization of model execution efficiency.
lvhan028's avatar
lvhan028 committed
33

pppppM's avatar
pppppM committed
34
![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
lvhan028's avatar
lvhan028 committed
35

lvhan028's avatar
lvhan028 committed
36
37
## Performance

38
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
lvhan028's avatar
lvhan028 committed
39

40
**Case II**: request throughput with real conversation data
lvhan028's avatar
lvhan028 committed
41

42
Test Setting: LLaMA-7B, NVIDIA A100(80G)
lvhan028's avatar
lvhan028 committed
43

44
45
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
lvhan028's avatar
lvhan028 committed
46

47
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
lvhan028's avatar
lvhan028 committed
48

lvhan028's avatar
lvhan028 committed
49
50
51
## Quick Start

### Installation
lvhan028's avatar
lvhan028 committed
52
53
54
55

Below are quick steps for installation:

```shell
56
conda create -n lmdeploy python=3.10 -y
lvhan028's avatar
lvhan028 committed
57
conda activate lmdeploy
lvhan028's avatar
lvhan028 committed
58
pip install lmdeploy
lvhan028's avatar
lvhan028 committed
59
60
```

lvhan028's avatar
lvhan028 committed
61
### Deploy InternLM
lvhan028's avatar
lvhan028 committed
62

lvhan028's avatar
lvhan028 committed
63
#### Get InternLM model
lvhan028's avatar
lvhan028 committed
64
65

```shell
lvhan028's avatar
lvhan028 committed
66
# 1. Download InternLM model
lvhan028's avatar
lvhan028 committed
67

pppppM's avatar
pppppM committed
68
69
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
del-zhenwu's avatar
del-zhenwu committed
70
git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-chat-7b
pppppM's avatar
pppppM committed
71
72
73
74
75

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

lvhan028's avatar
lvhan028 committed
76
# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
77
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lvhan028's avatar
lvhan028 committed
78
79
80

```

lvhan028's avatar
lvhan028 committed
81
#### Inference by TurboMind
lvhan028's avatar
lvhan028 committed
82
83

```shell
lvhan028's avatar
lvhan028 committed
84
python -m lmdeploy.turbomind.chat ./workspace
lvhan028's avatar
lvhan028 committed
85
86
```

lvhan028's avatar
lvhan028 committed
87
```{note}
88
When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
tpoisonooo's avatar
tpoisonooo committed
89
Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.
lvhan028's avatar
lvhan028 committed
90
91
```

lvhan028's avatar
lvhan028 committed
92
#### Serving
lvhan028's avatar
lvhan028 committed
93

lvhan028's avatar
lvhan028 committed
94
Launch inference server by:
lvhan028's avatar
lvhan028 committed
95
96

```shell
lvhan028's avatar
lvhan028 committed
97
bash workspace/service_docker_up.sh
lvhan028's avatar
lvhan028 committed
98
99
```

lvhan028's avatar
lvhan028 committed
100
Then, you can communicate with the inference server by command line,
lvhan028's avatar
lvhan028 committed
101
102

```shell
103
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lvhan028's avatar
lvhan028 committed
104
105
```

lvhan028's avatar
lvhan028 committed
106
or webui,
AllentDan's avatar
AllentDan committed
107

vansin's avatar
vansin committed
108
```shell
lvhan028's avatar
lvhan028 committed
109
python3 -m lmdeploy.app {server_ip_addresss}:33337
AllentDan's avatar
AllentDan committed
110
111
```

pppppM's avatar
pppppM committed
112
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
AllentDan's avatar
AllentDan committed
113

114
For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
lvhan028's avatar
lvhan028 committed
115

WRH's avatar
WRH committed
116
117
### Inference with PyTorch

118
119
120
121
122
123
You have to install deepspeed first before running with PyTorch.

```
pip install deepspeed
```

WRH's avatar
WRH committed
124
125
126
#### Single GPU

```shell
WRH's avatar
WRH committed
127
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
WRH's avatar
WRH committed
128
129
130
131
132
133
134
135
136
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

#### Tensor Parallel with DeepSpeed

```shell
WRH's avatar
WRH committed
137
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
WRH's avatar
WRH committed
138
139
140
141
142
143
144
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

145
146
147
## Quantization

In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
tpoisonooo's avatar
tpoisonooo committed
148
First execute the quantization script, and the quantization parameters are stored in the `workspace/triton_models/weights` transformed by `deploy.py`.
149
150
151
152
153
154
155
156
157
158

```
python3 -m lmdeploy.lite.apis.kv_qparams \
  --model $HF_MODEL \
  --output_dir $DEPLOY_WEIGHT_DIR \
  --symmetry True \   # Whether to use symmetric or asymmetric quantization.
  --offload  False \  # Whether to offload some modules to CPU to save GPU memory.
  --num_tp 1 \   # The number of GPUs used for tensor parallelism
```

tpoisonooo's avatar
tpoisonooo committed
159
Then adjust `workspace/triton_models/weights/config.ini`
lvhan028's avatar
lvhan028 committed
160

lvhan028's avatar
lvhan028 committed
161
162
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
lvhan028's avatar
lvhan028 committed
163

164
Here is [quantization test results](./docs/en/quantization.md).
165

lvhan028's avatar
lvhan028 committed
166
## Contributing
lvhan028's avatar
lvhan028 committed
167

lvhan028's avatar
lvhan028 committed
168
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
169

lvhan028's avatar
lvhan028 committed
170
171
172
173
174
175
176
## Acknowledgement

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)

## License

This project is released under the [Apache 2.0 license](LICENSE).