👋 join us on <ahref="https://twitter.com/intern_lm"target="_blank">Twitter</a>, <ahref="https://discord.gg/xa29JuW87d"target="_blank">Discord</a> and <ahref="https://r.vansin.top/?r=internwx"target="_blank">WeChat</a>
-\[2024/03\] Support VLM offline inference pipeline and serving.
-\[2024/02\] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
-\[2024/01\][OpenAOE](https://github.com/InternLM/OpenAOE) seamless integration with [LMDeploy Serving Service](./docs/en/serving/api_server.md).
-\[2024/01\] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer to [here](./docs/en/serving/proxy_server.md)
-\[2024/01\] Support [PyTorch inference engine](./docs/en/inference/pytorch.md), developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
</details>
<detailsclose>
<summary><b>2023</b></summary>
-\[2023/11\] Turbomind supports loading hf model directly. Click [here](./docs/en/load_hf.md) for details.
-\[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check [this](./docs/en/w4a16.md) guide for detailed info
-\[2023/08\] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Check [this](docs/en/quantization/w4a16.md) guide for detailed info
-\[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
-\[2023/08\] LMDeploy has launched on the [HuggingFace Hub](https://huggingface.co/lmdeploy), providing ready-to-use 4-bit models.
-\[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
-\[2023/08\] LMDeploy supports 4-bit quantization using the [AWQ](https://arxiv.org/abs/2306.00978) algorithm.
-\[2023/07\] TurboMind supports Llama-2 70B with GQA.
-\[2023/07\] TurboMind supports Llama-2 70B with GQA.
-\[2023/07\] TurboMind supports Llama-2 7B/13B.
-\[2023/07\] TurboMind supports Llama-2 7B/13B.
-\[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
-\[2023/07\] TurboMind supports tensor-parallel inference of InternLM.
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
-**Efficient Inference Engine (TurboMind)**: Based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer), we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.
-**Efficient Inference**: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.
-**Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.
-**Multi-GPU Model Deployment and Quantization**: We provide comprehensive model deployment and quantification support, and have been validated at different scales.
-**Persistent Batch Inference**: Further optimization of model execution efficiency.
`LMDeploy` has two inference backends, `Pytorch` and `TurboMind`. You can run `lmdeploy list` to check the supported model names.
### TurboMind
-**Effective Quantization**: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.
> **Note**<br />
-**Effortless Distribution Server**: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.
> W4A16 inference requires Nvidia GPU with Ampere architecture or above.
-**Interactive Inference Mode**: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.
For detailed inference benchmarks in more devices and more settings, please refer to the following link:
**Case I**: output token throughput with fixed input token and output token number (1, 2048)
-[A100](./docs/en/benchmark/a100_fp16.md)
- V100
- 4090
- 3090
- 2080
**Case II**: request throughput with real conversation data
# Supported Models
Test Setting: LLaMA-7B, NVIDIA A100(80G)
| Model | Size |
| :----------------: | :--------: |
| Llama | 7B - 65B |
| Llama2 | 7B - 70B |
| InternLM | 7B - 20B |
| InternLM2 | 7B - 20B |
| InternLM-XComposer | 7B |
| QWen | 7B - 72B |
| QWen1.5 | 0.5B - 72B |
| QWen-VL | 7B |
| Baichuan | 7B - 13B |
| Baichuan2 | 7B - 13B |
| Code Llama | 7B - 34B |
| ChatGLM2 | 6B |
| Falcon | 7B - 180B |
| YI | 6B - 34B |
| Mistral | 7B |
| DeepSeek-MoE | 16B |
| Mixtral | 8x7B |
| Gemma | 2B-7B |
The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x.
LMDeploy has developed two inference engines - [TurboMind](./docs/en/inference/turbomind.md) and [PyTorch](./docs/en/inference/pytorch.md), each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.
And the request throughput of TurboMind is 30% higher than vLLM.
They differ in the types of supported models and the inference data type. Please refer to [this table](./docs/en/supported_models/supported_models.md) for each engine's capability and choose the proper one that best fits your actual needs.
## Quick Start
# Quick Start
### Installation
## Installation
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
...
@@ -105,121 +119,54 @@ Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
...
@@ -105,121 +119,54 @@ Install lmdeploy with pip ( python 3.8+) or [from source](./docs/en/build.md)
pip install lmdeploy
pip install lmdeploy
```
```
> **Note**<br />
The default prebuilt package is compiled on CUDA 11.8. However, if CUDA 12+ is required, you can install lmdeploy by:
> `pip install lmdeploy` can only install the runtime required packages. If users want to run codes from modules like `lmdeploy.lite` and `lmdeploy.serve`, they need to install the extra required packages.
> For instance, running `pip install lmdeploy[lite]` would install extra dependencies for `lmdeploy.lite` module.
>
> - `all`: Install lmdeploy with all dependencies in `requirements.txt`
> - `lite`: Install lmdeploy with extra dependencies in `requirements/lite.txt`
> - `serve`: Install lmdeploy with dependencies in `requirements/serve.txt`
### Deploy InternLM
To use TurboMind inference engine, you need to first convert the model into TurboMind format. Currently, we support online conversion and offline conversion. With online conversion, TurboMind can load the Huggingface model directly. While with offline conversion, you should save the converted model first before using it.
The following use [internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) as a example to show how to use turbomind with online conversion. You can refer to [load_hf.md](docs/en/load_hf.md) for other methods.
Refer to [restful_api.md](docs/en/restful_api.md) for more details.
> \[!NOTE\]
> By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by `pip install modelscope` and set the environment variable:
### Inference with PyTorch
>
> `export LMDEPLOY_USE_MODELSCOPE=True`
For detailed instructions on Inference pytorch models, see [here](docs/en/pytorch.md).
👋 join us on <ahref="https://twitter.com/intern_lm"target="_blank">Twitter</a>, <ahref="https://discord.gg/xa29JuW87d"target="_blank">Discord</a> and <ahref="https://r.vansin.top/?r=internwx"target="_blank">WeChat</a>