OpenMMLab website ^HOT OpenMMLab platform ^{TRY IT OUT}

[![docs](https://img.shields.io/badge/docs-latest-blue)](https://llmdeploy.readthedocs.io/en/latest/) [![codecov](https://codecov.io/gh/open-mmlab/llmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/llmdeploy) [![license](https://img.shields.io/github/license/open-mmlab/llmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE) [![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues) [![open issues](https://img.shields.io/github/issues-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues) English | [简体中文](README_zh-CN.md)

## Introduction ## Installation Below are quick steps for installation: ```shell conda create -n open-mmlab python=3.8 conda activate open-mmlab git clone https://github.com/open-mmlab/llmdeploy.git cd llmdeploy pip install -e . ``` ## Quick Start ### Build Pull docker image `openmmlab/llmdeploy:base` and build llmdeploy libs in its launched container ```shell mkdir build && cd build ../generate.sh make -j$(nproc) && make install ``` ### Serving [LLaMA](https://github.com/facebookresearch/llama) Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform?usp=send_form) Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:

```shell python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \ --tokenizer_path /path/to/tokenizer/model bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install//backends/fastertransformer ```

13B

```shell python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \ --tokenizer_path /path/to/tokenizer/model --tp 2 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install//backends/fastertransformer ```

33B

```shell python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \ --tokenizer_path /path/to/tokenizer/model --tp 4 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install//backends/fastertransformer ```

65B

```shell python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \ --tokenizer_path /path/to/tokenizer/model --tp 8 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install//backends/fastertransformer ```

### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)

```shell python3 -m pip install fschat python3 -m fastchat.model.apply_delta \ --base-model-path /path/to/llama-7b \ --target-model-path /path/to/vicuna-7b \ --delta-path lmsys/vicuna-7b-delta-v1.1 python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install//backends/fastertransformer ```

13B

```shell python3 -m pip install fschat python3 -m fastchat.model.apply_delta \ --base-model-path /path/to/llama-13b \ --target-model-path /path/to/vicuna-13b \ --delta-path lmsys/vicuna-13b-delta-v1.1 python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install//backends/fastertransformer ```

## Inference with Command Line Interface ```shell python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1 ``` ## User Guide ## Contributing We appreciate all contributions to LLMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline. ## Acknowledgement - [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) ## License This project is released under the [Apache 2.0 license](LICENSE).