README.md 5.77 KB
Newer Older
lvhan028's avatar
lvhan028 committed
1
<div align="center">
lvhan028's avatar
lvhan028 committed
2
  <img src="resources/lmdeploy-logo.png" width="450"/>
lvhan028's avatar
lvhan028 committed
3
4
5
6
7

English | [简体中文](README_zh-CN.md)

</div>

8
<p align="center">
vansin's avatar
vansin committed
9
    👋 join us on <a href="https://twitter.com/intern_lm" target="_blank">Twitter</a>, <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=internwx" target="_blank">WeChat</a>
10
</p>
lvhan028's avatar
lvhan028 committed
11

12
13
______________________________________________________________________

q.yao's avatar
q.yao committed
14
## News 🎉
15

16
17
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
q.yao's avatar
q.yao committed
18
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
19
20
21

______________________________________________________________________

lvhan028's avatar
lvhan028 committed
22
23
## Introduction

lvhan028's avatar
lvhan028 committed
24
25
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:

tpoisonooo's avatar
tpoisonooo committed
26
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
lvhan028's avatar
lvhan028 committed
27

28
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
lvhan028's avatar
lvhan028 committed
29

tpoisonooo's avatar
tpoisonooo committed
30
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
31
32

- **Persistent Batch Inference**: Further optimization of model execution efficiency.
lvhan028's avatar
lvhan028 committed
33

pppppM's avatar
pppppM committed
34
![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
lvhan028's avatar
lvhan028 committed
35

lvhan028's avatar
lvhan028 committed
36
37
## Performance

38
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
lvhan028's avatar
lvhan028 committed
39

40
**Case II**: request throughput with real conversation data
lvhan028's avatar
lvhan028 committed
41

42
Test Setting: LLaMA-7B, NVIDIA A100(80G)
lvhan028's avatar
lvhan028 committed
43

44
45
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
lvhan028's avatar
lvhan028 committed
46

47
![benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/7775c518-608e-4e5b-be73-7645a444e774)
lvhan028's avatar
lvhan028 committed
48

lvhan028's avatar
lvhan028 committed
49
50
51
## Quick Start

### Installation
lvhan028's avatar
lvhan028 committed
52
53
54
55

Below are quick steps for installation:

```shell
56
conda create -n lmdeploy python=3.10 -y
lvhan028's avatar
lvhan028 committed
57
conda activate lmdeploy
58
git clone https://github.com/InternLM/lmdeploy.git
lvhan028's avatar
lvhan028 committed
59
cd lmdeploy
lvhan028's avatar
lvhan028 committed
60
61
62
pip install -e .
```

lvhan028's avatar
lvhan028 committed
63
### Deploy InternLM
lvhan028's avatar
lvhan028 committed
64

lvhan028's avatar
lvhan028 committed
65
#### Get InternLM model
lvhan028's avatar
lvhan028 committed
66
67

```shell
lvhan028's avatar
lvhan028 committed
68
# 1. Download InternLM model
lvhan028's avatar
lvhan028 committed
69

pppppM's avatar
pppppM committed
70
71
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
del-zhenwu's avatar
del-zhenwu committed
72
git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-chat-7b
pppppM's avatar
pppppM committed
73
74
75
76
77

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

lvhan028's avatar
lvhan028 committed
78
# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
79
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
lvhan028's avatar
lvhan028 committed
80
81
82

```

lvhan028's avatar
lvhan028 committed
83
#### Inference by TurboMind
lvhan028's avatar
lvhan028 committed
84
85

```shell
tpoisonooo's avatar
tpoisonooo committed
86
docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
87
    python3 -m lmdeploy.turbomind.chat /workspace
lvhan028's avatar
lvhan028 committed
88
89
```

lvhan028's avatar
lvhan028 committed
90
```{note}
91
When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
tpoisonooo's avatar
tpoisonooo committed
92
Disable GPU ECC can free up 10% memory, try `sudo nvidia-smi --ecc-config=0` and reboot system.
lvhan028's avatar
lvhan028 committed
93
94
```

lvhan028's avatar
lvhan028 committed
95
#### Serving
lvhan028's avatar
lvhan028 committed
96

lvhan028's avatar
lvhan028 committed
97
Launch inference server by:
lvhan028's avatar
lvhan028 committed
98
99

```shell
lvhan028's avatar
lvhan028 committed
100
bash workspace/service_docker_up.sh
lvhan028's avatar
lvhan028 committed
101
102
```

lvhan028's avatar
lvhan028 committed
103
Then, you can communicate with the inference server by command line,
lvhan028's avatar
lvhan028 committed
104
105

```shell
106
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
lvhan028's avatar
lvhan028 committed
107
108
```

lvhan028's avatar
lvhan028 committed
109
or webui,
AllentDan's avatar
AllentDan committed
110

vansin's avatar
vansin committed
111
```shell
112
python3 -m lmdeploy.app {server_ip_addresss}:33337 internlm
AllentDan's avatar
AllentDan committed
113
114
```

pppppM's avatar
pppppM committed
115
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
AllentDan's avatar
AllentDan committed
116

117
For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
lvhan028's avatar
lvhan028 committed
118

WRH's avatar
WRH committed
119
120
### Inference with PyTorch

121
122
123
124
125
126
You have to install deepspeed first before running with PyTorch.

```
pip install deepspeed
```

WRH's avatar
WRH committed
127
128
129
#### Single GPU

```shell
WRH's avatar
WRH committed
130
python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
WRH's avatar
WRH committed
131
132
133
134
135
136
137
138
139
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

#### Tensor Parallel with DeepSpeed

```shell
WRH's avatar
WRH committed
140
deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
WRH's avatar
WRH committed
141
142
143
144
145
146
147
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

148
149
150
## Quantization

In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
tpoisonooo's avatar
tpoisonooo committed
151
First execute the quantization script, and the quantization parameters are stored in the `workspace/triton_models/weights` transformed by `deploy.py`.
152
153
154
155
156
157
158
159
160
161

```
python3 -m lmdeploy.lite.apis.kv_qparams \
  --model $HF_MODEL \
  --output_dir $DEPLOY_WEIGHT_DIR \
  --symmetry True \   # Whether to use symmetric or asymmetric quantization.
  --offload  False \  # Whether to offload some modules to CPU to save GPU memory.
  --num_tp 1 \   # The number of GPUs used for tensor parallelism
```

tpoisonooo's avatar
tpoisonooo committed
162
Then adjust `workspace/triton_models/weights/config.ini`
lvhan028's avatar
lvhan028 committed
163

lvhan028's avatar
lvhan028 committed
164
165
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
lvhan028's avatar
lvhan028 committed
166

167
Here is [quantization test results](./docs/en/quantization.md).
168

lvhan028's avatar
lvhan028 committed
169
## Contributing
lvhan028's avatar
lvhan028 committed
170

lvhan028's avatar
lvhan028 committed
171
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
172

lvhan028's avatar
lvhan028 committed
173
174
175
176
177
178
179
## Acknowledgement

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)

## License

This project is released under the [Apache 2.0 license](LICENSE).