👋 join us on <ahref="https://twitter.com/intern_lm"target="_blank">Twitter</a>, <ahref="https://discord.gg/xa29JuW87d"target="_blank">Discord</a> and <ahref="https://r.vansin.top/?r=internwx"target="_blank">WeChat</a>
-\[2024/03\] Support VLM offline inference pipeline and serving.
-\[2024/02\] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
-\[2024/01\][OpenAOE](https://github.com/InternLM/OpenAOE) seamless integration with [LMDeploy Serving Service](./docs/en/serving/api_server.md).
-\[2024/01\] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to [here](./docs/en/serving/proxy_server.md)
-\[2024/01\] Support [PyTorch inference engine](./docs/en/inference/pytorch.md), developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
</details>
<detailsclose>
<summary><b>2023</b></summary>
-\[2023/11\] Turbomind supports loading hf model directly. Click [here](./docs/en/load_hf.md) for details.
-\[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
-\[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check [this](docs/en/quantization/w4a16.md) guide for detailed info
-\[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
-\[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
-\[2023/07\] TurboMind supports Llama-2 70B with GQA.
-\[2023/07\] TurboMind supports Llama-2 7B/13B.
-\[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
-**Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
-**Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
-**Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
-**Persistent Batch Inference**: Further optimization of model execution efficiency.
`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`. You can run `lmdeploy list` to check the supported model names.
-**Efficient Inference**: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.
### TurboMind
-**Effective Quantization**: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.
> **Note**<br />
> W4A16 inference requires Nvidia GPU with Ampere architecture or above.
-**Effortless Distribution Server**: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.
-**Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.
For detailed inference benchmarks in more devices and more settings, please refer to the following link:
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
-[A100](./docs/en/benchmark/a100_fp16.md)
- V100
- 4090
- 3090
- 2080
**Case II**: request throughput with real conversation data
# Supported Models
Test Setting: LLaMA-7B, NVIDIA A100(80G)
| Model | Size |
| :----------------: | :--------: |
| Llama | 7B - 65B |
| Llama2 | 7B - 70B |
| InternLM | 7B - 20B |
| InternLM2 | 7B - 20B |
| InternLM-XComposer | 7B |
| QWen | 7B - 72B |
| QWen1.5 | 0.5B - 72B |
| QWen-VL | 7B |
| Baichuan | 7B - 13B |
| Baichuan2 | 7B - 13B |
| Code Llama | 7B - 34B |
| ChatGLM2 | 6B |
| Falcon | 7B - 180B |
| YI | 6B - 34B |
| Mistral | 7B |
| DeepSeek-MoE | 16B |
| Mixtral | 8x7B |
| Gemma | 2B-7B |
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
And the request throughput of TurboMind is 30% higher than vLLM.
LMDeploy has developed two inference engines - [TurboMind](./docs/en/inference/turbomind.md) and [PyTorch](./docs/en/inference/pytorch.md), each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.
They differ in the types of supported models and the inference data type. Please refer to [this table](./docs/en/supported_models/supported_models.md) for each engine's capability and choose the proper one that best fits your actual needs.
## Quick Start
# Quick Start
### Installation
## Installation
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
...
...
@@ -105,121 +119,54 @@ Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
pip install lmdeploy
```
> **Note**<br />
> `pip install lmdeploy` can only install the runtime required packages. If users want to run codes from modules like `lmdeploy.lite` and `lmdeploy.serve`, they need to install the extra required packages.
> For instance, running `pip install lmdeploy[lite]` would install extra dependencies for `lmdeploy.lite` module.
>
> - `all`: Install lmdeploy with all dependencies in `requirements.txt`
> - `lite`: Install lmdeploy with extra dependencies in `requirements/lite.txt`
> - `serve`: Install lmdeploy with dependencies in `requirements/serve.txt`
### Deploy InternLM
To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.
The following use [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.
You need to install deepspeed first to use this feature.
```
pip install deepspeed
```
> \[!NOTE\]
> By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by `pip install modelscope` and set the environment variable:
>
> `export LMDEPLOY_USE_MODELSCOPE=True`
## Quantization
For more information about inference pipeline, please refer to [here](./docs/en/inference/pipeline.md).
#### Weight INT4 Quantization
# Tutorials
LMDeploy uses [AWQ](https://arxiv.org/abs/2306.00978) algorithm for model weight quantization
Please overview [getting_started](./docs/en/get_started.md) section for the basic usage of LMDeploy.
[Click here](./docs/en/w4a16.md) to view the test results for weight int4 usage.
For detailed user guides and advanced guides, please refer to our [tutorials](https://lmdeploy.readthedocs.io/en/latest/):
👋 join us on <ahref="https://twitter.com/intern_lm"target="_blank">Twitter</a>, <ahref="https://discord.gg/xa29JuW87d"target="_blank">Discord</a> and <ahref="https://r.vansin.top/?r=internwx"target="_blank">WeChat</a>