README_origin.md 9.01 KB
Newer Older
xiabo's avatar
xiabo committed
1
<div align="center">
zhouxiang's avatar
zhouxiang committed
2
  <img src="docs/en/_static/image/lmdeploy-logo.svg" width="450"/>
xiabo's avatar
xiabo committed
3
4

[![PyPI](https://img.shields.io/pypi/v/lmdeploy)](https://pypi.org/project/lmdeploy)
zhouxiang's avatar
zhouxiang committed
5
![PyPI - Downloads](https://img.shields.io/pypi/dm/lmdeploy)
xiabo's avatar
xiabo committed
6
7
8
9
[![license](https://img.shields.io/github/license/InternLM/lmdeploy.svg)](https://github.com/InternLM/lmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/InternLM/lmdeploy)](https://github.com/InternLM/lmdeploy/issues)

zhouxiang's avatar
zhouxiang committed
10
11
12
13
[📘Documentation](https://lmdeploy.readthedocs.io/en/latest/) |
[🛠️Quick Start](https://lmdeploy.readthedocs.io/en/latest/get_started.html) |
[🤔Reporting Issues](https://github.com/InternLM/lmdeploy/issues/new/choose)

xiabo's avatar
xiabo committed
14
15
English | [简体中文](README_zh-CN.md)

zhouxiang's avatar
zhouxiang committed
16
17
18
👋 join us on [![Static Badge](https://img.shields.io/badge/-grey?style=social&logo=wechat&label=WeChat)](https://r.vansin.top/?r=internwx)
[![Static Badge](https://img.shields.io/badge/-grey?style=social&logo=twitter&label=Twitter)](https://twitter.com/intern_lm)
[![Static Badge](https://img.shields.io/badge/-grey?style=social&logo=discord&label=Discord)](https://discord.gg/xa29JuW87d)
xiabo's avatar
xiabo committed
19

zhouxiang's avatar
zhouxiang committed
20
</div>
xiabo's avatar
xiabo committed
21
22
23

______________________________________________________________________

zhouxiang's avatar
zhouxiang committed
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
## Latest News 🎉

<details open>
<summary><b>2024</b></summary>

- \[2024/03\] Support VLM offline inference pipeline and serving.
- \[2024/02\] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
- \[2024/01\] [OpenAOE](https://github.com/InternLM/OpenAOE) seamless integration with [LMDeploy Serving Service](./docs/en/serving/api_server.md).
- \[2024/01\] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to [here](./docs/en/serving/proxy_server.md)
- \[2024/01\] Support [PyTorch inference engine](./docs/en/inference/pytorch.md), developed entirely in Python, helping to lower the barriers for developers and enable  rapid experimentation with new features and technologies.

</details>

<details close>
<summary><b>2023</b></summary>
xiabo's avatar
xiabo committed
39

zhouxiang's avatar
zhouxiang committed
40
41
- \[2023/12\] Turbomind supports multimodal input. [Gradio Demo](./examples/vl/README.md)
- \[2023/11\] Turbomind supports loading hf model directly. Click [here](docs/en/inference/load_hf.md) for details.
xiabo's avatar
xiabo committed
42
43
44
45
46
47
48
49
- \[2023/11\] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
- \[2023/09\] TurboMind supports Qwen-14B
- \[2023/09\] TurboMind supports InternLM-20B
- \[2023/09\] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click [here](./docs/en/supported_models/codellama.md) for deployment guide
- \[2023/09\] TurboMind supports Baichuan2-7B
- \[2023/08\] TurboMind supports flash-attention2.
- \[2023/08\] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
- \[2023/08\] TurboMind supports Windows (tp=1)
zhouxiang's avatar
zhouxiang committed
50
- \[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check [this](docs/en/quantization/w4a16.md) guide for detailed info
xiabo's avatar
xiabo committed
51
52
53
54
55
56
- \[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
- \[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
- \[2023/07\] TurboMind supports Llama-2 70B with GQA.
- \[2023/07\] TurboMind supports Llama-2 7B/13B.
- \[2023/07\] TurboMind supports tensor-parallel inference of InternLM.

zhouxiang's avatar
zhouxiang committed
57
58
</details>

xiabo's avatar
xiabo committed
59
60
______________________________________________________________________

zhouxiang's avatar
zhouxiang committed
61
# Introduction
xiabo's avatar
xiabo committed
62
63
64

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:

zhouxiang's avatar
zhouxiang committed
65
- **Efficient Inference**: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.
xiabo's avatar
xiabo committed
66

zhouxiang's avatar
zhouxiang committed
67
- **Effective Quantization**: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.
xiabo's avatar
xiabo committed
68

zhouxiang's avatar
zhouxiang committed
69
- **Effortless Distribution Server**: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.
xiabo's avatar
xiabo committed
70

zhouxiang's avatar
zhouxiang committed
71
- **Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.
xiabo's avatar
xiabo committed
72

zhouxiang's avatar
zhouxiang committed
73
# Performance
xiabo's avatar
xiabo committed
74

zhouxiang's avatar
zhouxiang committed
75
![v0 1 0-benchmark](https://github.com/InternLM/lmdeploy/assets/4560679/8e455cf1-a792-4fa8-91a2-75df96a2a5ba)
xiabo's avatar
xiabo committed
76

zhouxiang's avatar
zhouxiang committed
77
For detailed inference benchmarks in more devices and more settings, please refer to the following link:
xiabo's avatar
xiabo committed
78

zhouxiang's avatar
zhouxiang committed
79
80
81
82
83
- [A100](./docs/en/benchmark/a100_fp16.md)
- V100
- 4090
- 3090
- 2080
xiabo's avatar
xiabo committed
84

zhouxiang's avatar
zhouxiang committed
85
# Supported Models
xiabo's avatar
xiabo committed
86

zhouxiang's avatar
zhouxiang committed
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
|       Model        |    Size    |
| :----------------: | :--------: |
|       Llama        |  7B - 65B  |
|       Llama2       |  7B - 70B  |
|      InternLM      |  7B - 20B  |
|     InternLM2      |  7B - 20B  |
| InternLM-XComposer |     7B     |
|        QWen        |  7B - 72B  |
|      QWen1.5       | 0.5B - 72B |
|      QWen-VL       |     7B     |
|      Baichuan      |  7B - 13B  |
|     Baichuan2      |  7B - 13B  |
|     Code Llama     |  7B - 34B  |
|      ChatGLM2      |     6B     |
|       Falcon       | 7B - 180B  |
|         YI         |  6B - 34B  |
|      Mistral       |     7B     |
|    DeepSeek-MoE    |    16B     |
|      Mixtral       |    8x7B    |
|       Gemma        |   2B-7B    |
xiabo's avatar
xiabo committed
107

zhouxiang's avatar
zhouxiang committed
108
LMDeploy has developed two inference engines - [TurboMind](./docs/en/inference/turbomind.md) and [PyTorch](./docs/en/inference/pytorch.md), each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.
xiabo's avatar
xiabo committed
109

zhouxiang's avatar
zhouxiang committed
110
They differ in the types of supported models and the inference data type. Please refer to [this table](./docs/en/supported_models/supported_models.md) for each engine's capability and choose the proper one that best fits your actual needs.
xiabo's avatar
xiabo committed
111

zhouxiang's avatar
zhouxiang committed
112
# Quick Start
xiabo's avatar
xiabo committed
113

zhouxiang's avatar
zhouxiang committed
114
## Installation
xiabo's avatar
xiabo committed
115
116
117
118
119
120
121

Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)

```shell
pip install lmdeploy
```

zhouxiang's avatar
zhouxiang committed
122
The default prebuilt package is compiled on CUDA 11.8. However, if CUDA 12+ is required, you can install lmdeploy by:
xiabo's avatar
xiabo committed
123
124

```shell
zhouxiang's avatar
zhouxiang committed
125
126
127
export LMDEPLOY_VERSION=0.2.0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl
xiabo's avatar
xiabo committed
128
129
```

zhouxiang's avatar
zhouxiang committed
130
## Offline Batch Inference
xiabo's avatar
xiabo committed
131

zhouxiang's avatar
zhouxiang committed
132
133
134
135
136
```python
import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
xiabo's avatar
xiabo committed
137
138
```

zhouxiang's avatar
zhouxiang committed
139
140
141
142
> \[!NOTE\]
> By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by `pip install modelscope` and set the environment variable:
>
> `export LMDEPLOY_USE_MODELSCOPE=True`
xiabo's avatar
xiabo committed
143

zhouxiang's avatar
zhouxiang committed
144
For more information about inference pipeline, please refer to [here](./docs/en/inference/pipeline.md).
xiabo's avatar
xiabo committed
145

zhouxiang's avatar
zhouxiang committed
146
# Tutorials
xiabo's avatar
xiabo committed
147

zhouxiang's avatar
zhouxiang committed
148
Please overview [getting_started](./docs/en/get_started.md) section for the basic usage of LMDeploy.
xiabo's avatar
xiabo committed
149

zhouxiang's avatar
zhouxiang committed
150
For detailed user guides and advanced guides, please refer to our [tutorials](https://lmdeploy.readthedocs.io/en/latest/):
xiabo's avatar
xiabo committed
151

zhouxiang's avatar
zhouxiang committed
152
153
154
155
156
157
158
159
160
161
162
163
164
165
- User Guide
  - [LLM Inference pipeline](./docs/en/inference/pipeline.md)
  - [VLM Inference pipeline](./docs/en/inference/vl_pipeline.md)
  - [LLM Serving](docs/en/serving/api_server.md)
  - [VLM Serving](docs/en/serving/api_server_vl.md)
  - [Quantization](docs/en/quantization)
- Advance Guide
  - [Inference Engine - TurboMind](docs/en/inference/turbomind.md)
  - [Inference Engine - PyTorch](docs/en/inference/pytorch.md)
  - [Customize chat templates](docs/en/advance/chat_template.md)
  - [Add a new model](docs/en/advance/pytorch_new_model.md)
  - gemm tuning
  - [Long context inference](docs/en/advance/long_context.md)
  - [Multi-model inference service](docs/en/serving/proxy_server.md)
xiabo's avatar
xiabo committed
166

zhouxiang's avatar
zhouxiang committed
167
# Third-party projects
xiabo's avatar
xiabo committed
168

zhouxiang's avatar
zhouxiang committed
169
- Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: [LMDeploy-Jetson](https://github.com/BestAnHongjun/LMDeploy-Jetson)
xiabo's avatar
xiabo committed
170
171
172
173
174
175
176
177
178

## Contributing

We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.

## Acknowledgement

- [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)
- [llm-awq](https://github.com/mit-han-lab/llm-awq)
zhouxiang's avatar
zhouxiang committed
179
180
- [vLLM](https://github.com/vllm-project/vllm)
- [DeepSpeed-MII](https://github.com/microsoft/DeepSpeed-MII)
xiabo's avatar
xiabo committed
181
182
183
184

## License

This project is released under the [Apache 2.0 license](LICENSE).