README.md 6.99 KB
Newer Older
lvhan028's avatar
lvhan028 committed
1
<div align="center">
lvhan028's avatar
lvhan028 committed
2
  <img src="resources/lmdeploy-logo.png" width="450"/>
lvhan028's avatar
lvhan028 committed
3
4
5
6
7
8
9

English | [简体中文](README_zh-CN.md)

</div>

<div align="center">
  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
lvhan028's avatar
lvhan028 committed
10
    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
lvhan028's avatar
lvhan028 committed
11
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
lvhan028's avatar
lvhan028 committed
12
  <a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
lvhan028's avatar
lvhan028 committed
13
14
15
16
17
18
19
    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218346637-d30c8a0f-3eba-4699-8131-512fb06d46db.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
lvhan028's avatar
lvhan028 committed
20
21
22
23
24
25
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
lvhan028's avatar
lvhan028 committed
26
27
28
29
</div>

## Introduction

lvhan028's avatar
lvhan028 committed
30
31
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:

tpoisonooo's avatar
tpoisonooo committed
32
- **Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
lvhan028's avatar
lvhan028 committed
33

34
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
lvhan028's avatar
lvhan028 committed
35

tpoisonooo's avatar
tpoisonooo committed
36
- **Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
37
38

- **Persistent Batch Inference**: Further optimization of model execution efficiency.
lvhan028's avatar
lvhan028 committed
39

pppppM's avatar
pppppM committed
40
![PersistentBatchInference](https://github.com/InternLM/lmdeploy/assets/67539920/e3876167-0671-44fc-ac52-5a0f9382493e)
lvhan028's avatar
lvhan028 committed
41

lvhan028's avatar
lvhan028 committed
42
43
44
45
46
47
48
49
50
51
52
53
## Performance

As shown in the figure below, we have compared the token generation speed among facebookresearch/llama, HuggingFace Transformers, and DeepSpeed on the 7B model.

Target Device: NVIDIA A100(80G)

Metrics: Throughput (token/s)

Test Data: The number of input tokens is 1, and the number of generated tokens is 2048

The throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x

pppppM's avatar
pppppM committed
54
![benchmark](https://user-images.githubusercontent.com/12756472/251422522-e94a3db9-eb16-432a-8d8c-078945e7b99a.png)
lvhan028's avatar
lvhan028 committed
55

lvhan028's avatar
lvhan028 committed
56
57
58
## Quick Start

### Installation
lvhan028's avatar
lvhan028 committed
59
60
61
62

Below are quick steps for installation:

```shell
lvhan028's avatar
lvhan028 committed
63
64
conda create -n lmdeploy python=3.10
conda activate lmdeploy
65
git clone https://github.com/InternLM/lmdeploy.git
lvhan028's avatar
lvhan028 committed
66
cd lmdeploy
lvhan028's avatar
lvhan028 committed
67
68
69
pip install -e .
```

lvhan028's avatar
lvhan028 committed
70
### Deploy InternLM
lvhan028's avatar
lvhan028 committed
71

lvhan028's avatar
lvhan028 committed
72
#### Get InternLM model
lvhan028's avatar
lvhan028 committed
73
74

```shell
lvhan028's avatar
lvhan028 committed
75
# 1. Download InternLM model
lvhan028's avatar
lvhan028 committed
76

pppppM's avatar
pppppM committed
77
78
79
80
81
82
83
84
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-7b /path/to/internlm-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

lvhan028's avatar
lvhan028 committed
85
86
# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-7b hf
lvhan028's avatar
lvhan028 committed
87
88
89

```

lvhan028's avatar
lvhan028 committed
90
#### Inference by TurboMind
lvhan028's avatar
lvhan028 committed
91
92

```shell
tpoisonooo's avatar
tpoisonooo committed
93
docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
lvhan028's avatar
lvhan028 committed
94
    python3 -m lmdeploy.turbomind.chat internlm /workspace
lvhan028's avatar
lvhan028 committed
95
96
```

lvhan028's avatar
lvhan028 committed
97
98
```{note}
When inferring with FP16 precision, the InternLM-7B model requires at least 22.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
lvhan028's avatar
lvhan028 committed
99
100
```

lvhan028's avatar
lvhan028 committed
101
#### Serving
lvhan028's avatar
lvhan028 committed
102

lvhan028's avatar
lvhan028 committed
103
Launch inference server by:
lvhan028's avatar
lvhan028 committed
104
105

```shell
lvhan028's avatar
lvhan028 committed
106
bash workspace/service_docker_up.sh
lvhan028's avatar
lvhan028 committed
107
108
```

lvhan028's avatar
lvhan028 committed
109
Then, you can communicate with the inference server by command line,
lvhan028's avatar
lvhan028 committed
110
111

```shell
lvhan028's avatar
lvhan028 committed
112
python3 lmdeploy.serve.client {server_ip_addresss}:33337 internlm
lvhan028's avatar
lvhan028 committed
113
114
```

lvhan028's avatar
lvhan028 committed
115
or webui,
AllentDan's avatar
AllentDan committed
116

lvhan028's avatar
lvhan028 committed
117
118
```
python3 lmdeploy.app {server_ip_addresss}:33337 internlm
AllentDan's avatar
AllentDan committed
119
120
```

pppppM's avatar
pppppM committed
121
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
AllentDan's avatar
AllentDan committed
122

lvhan028's avatar
lvhan028 committed
123
For the deployment of other supported models, such as LLaMA, vicuna, you can find the guide from [here](docs/en/serving.md)
lvhan028's avatar
lvhan028 committed
124

WRH's avatar
WRH committed
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
### Inference with PyTorch

#### Single GPU

```shell
python3 -m lmdeploy.torch.chat $NAME_OR_PATH_TO_HF_MODEL\
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

#### Tensor Parallel with DeepSpeed

```shell
deepspeed --module --num_gpus 2 lmdeploy.torch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0
```

148
149
150
151
## Quantization

In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`.
152
153
154
155
156
157
158
159
160
161

```
python3 -m lmdeploy.lite.apis.kv_qparams \
  --model $HF_MODEL \
  --output_dir $DEPLOY_WEIGHT_DIR \
  --symmetry True \   # Whether to use symmetric or asymmetric quantization.
  --offload  False \  # Whether to offload some modules to CPU to save GPU memory.
  --num_tp 1 \   # The number of GPUs used for tensor parallelism
```

162
Then adjust `config.ini`
lvhan028's avatar
lvhan028 committed
163

lvhan028's avatar
lvhan028 committed
164
165
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
lvhan028's avatar
lvhan028 committed
166

167
168
Here is [quantization test results](./docs/zh_cn/quantization.md).

lvhan028's avatar
lvhan028 committed
169
## Contributing
lvhan028's avatar
lvhan028 committed
170

lvhan028's avatar
lvhan028 committed
171
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
172

lvhan028's avatar
lvhan028 committed
173
174
175
176
177
178
179
## Acknowledgement

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)

## License

This project is released under the [Apache 2.0 license](LICENSE).