OpenMMLab website ^HOT OpenMMLab platform ^{TRY IT OUT}

[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/) [![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy) [![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE) [![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues) [![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues) English | [简体中文](README_zh-CN.md)

## Introduction LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features: - A high throughput inference engine named as **TurboMind** based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) for LLaMA family models - Interactive generation is supported. LMDeploy can remember the history by caching the attention k/v in multi-turn dialogues, so that it can avoid repetitive decoding of historical conversations.

- Support persistent-batch inference TODO: gif to show what persistent batch is ## Quick Start ### Installation Below are quick steps for installation: ```shell conda create -n open-mmlab python=3.8 conda activate open-mmlab git clone https://github.com/open-mmlab/lmdeploy.git cd lmdeploy pip install -e . ``` ### Build Pull docker image `openmmlab/lmdeploy:latest` and build lmdeploy libs in its launched container ```shell mkdir build && cd build ../generate.sh make -j$(nproc) && make install ``` ### Serving [LLaMA](https://github.com/facebookresearch/llama) Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform?usp=send_form) Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:

```shell python3 lmdeploy/serve/turbomind/deploy.py llama-7B /path/to/llama-7b llama \ --tokenizer_path /path/to/tokenizer/model bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer ```

13B

```shell python3 lmdeploy/serve/turbomind/deploy.py llama-13B /path/to/llama-13b llama \ --tokenizer_path /path/to/tokenizer/model --tp 2 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer ```

### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)

```shell python3 -m pip install fschat python3 -m fastchat.model.apply_delta \ --base-model-path /path/to/llama-7b \ --target-model-path /path/to/vicuna-7b \ --delta-path lmsys/vicuna-7b-delta-v1.1 python3 lmdeploy/serve/turbomind/deploy.py vicuna-7B /path/to/vicuna-7b hf bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer ```

13B

```shell python3 -m pip install fschat python3 -m fastchat.model.apply_delta \ --base-model-path /path/to/llama-13b \ --target-model-path /path/to/vicuna-13b \ --delta-path lmsys/vicuna-13b-delta-v1.1 python3 lmdeploy/serve/turbomind/deploy.py vicuna-13B /path/to/vicuna-13b hf bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer ```

## Inference with Command Line Interface ```shell python3 lmdeploy/serve/client.py {server_ip_addresss}:33337 ``` ## Inference with Web UI ```shell python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name} ``` ## User Guide ## Quantization In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users. First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`. Then adjust `config.ini` - `use_context_fmha` changed to 0, means off - `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled ## Contributing We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline. ## Acknowledgement - [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) ## License This project is released under the [Apache 2.0 license](LICENSE).