Unverified Commit 46f4738c authored by lvhan028's avatar lvhan028 Committed by GitHub
Browse files

rename llmdeploy to lmdeploy (#30)

* change llmdeploy to lmdeploy

* update logo

* update readme
parent 081a6e89
...@@ -50,4 +50,4 @@ repos: ...@@ -50,4 +50,4 @@ repos:
rev: v0.2.0 rev: v0.2.0
hooks: hooks:
- id: check-copyright - id: check-copyright
args: ["llmdeploy"] args: ["lmdeploy"]
<div align="center"> <div align="center">
<img src="resources/llmdeploy-logo.png" width="450"/> <img src="resources/lmdeploy-logo.png" width="450"/>
<div>&nbsp;</div> <div>&nbsp;</div>
<div align="center"> <div align="center">
<b><font size="5">OpenMMLab website</font></b> <b><font size="5">OpenMMLab website</font></b>
...@@ -18,11 +18,11 @@ ...@@ -18,11 +18,11 @@
</div> </div>
<div>&nbsp;</div> <div>&nbsp;</div>
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://llmdeploy.readthedocs.io/en/latest/) [![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![codecov](https://codecov.io/gh/open-mmlab/llmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/llmdeploy) [![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
[![license](https://img.shields.io/github/license/open-mmlab/llmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE) [![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues) [![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues) [![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
English | [简体中文](README_zh-CN.md) English | [简体中文](README_zh-CN.md)
...@@ -30,9 +30,9 @@ English | [简体中文](README_zh-CN.md) ...@@ -30,9 +30,9 @@ English | [简体中文](README_zh-CN.md)
<div align="center"> <div align="center">
<a href="https://openmmlab.medium.com/" style="text-decoration:none;"> <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218352562-cdded397-b0f3-4ca1-b8dd-a60df8dca75b.png" width="3%" alt="" /></a> <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" /> <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;"> <a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a> <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" /> <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://twitter.com/OpenMMLab" style="text-decoration:none;"> <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
...@@ -40,27 +40,47 @@ English | [简体中文](README_zh-CN.md) ...@@ -40,27 +40,47 @@ English | [简体中文](README_zh-CN.md)
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" /> <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://www.youtube.com/openmmlab" style="text-decoration:none;"> <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a> <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
</div> </div>
## Introduction ## Introduction
## Installation LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
- A high throughput inference engine named as **TurboMind** based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) for LLaMA family models
- Interactive generation is supported. LMDeploy can remember the history by caching the attention k/v in multi-turn dialogues, so that it can avoid repetitive decoding of historical conversations.
<div align="center">
<img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
</div>
- Support persistent-batch inference
TODO: gif to show what persistent batch is
## Quick Start
### Installation
Below are quick steps for installation: Below are quick steps for installation:
```shell ```shell
conda create -n open-mmlab python=3.8 conda create -n open-mmlab python=3.8
conda activate open-mmlab conda activate open-mmlab
git clone https://github.com/open-mmlab/llmdeploy.git git clone https://github.com/open-mmlab/lmdeploy.git
cd llmdeploy cd lmdeploy
pip install -e . pip install -e .
``` ```
## Quick Start
### Build ### Build
Pull docker image `openmmlab/llmdeploy:base` and build llmdeploy libs in its launched container Pull docker image `openmmlab/lmdeploy:latest` and build lmdeploy libs in its launched container
```shell ```shell
mkdir build && cd build mkdir build && cd build
...@@ -78,7 +98,7 @@ Run one of the following commands to serve a LLaMA model on NVIDIA GPU server: ...@@ -78,7 +98,7 @@ Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:
<summary><b>7B</b></summary> <summary><b>7B</b></summary>
```shell ```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \ python3 lmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model --tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
``` ```
...@@ -89,35 +109,13 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast ...@@ -89,35 +109,13 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
<summary><b>13B</b></summary> <summary><b>13B</b></summary>
```shell ```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \ python3 lmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2 --tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
``` ```
</details> </details>
<details open>
<summary><b>33B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
</details>
<details open>
<summary><b>65B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
</details>
### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) ### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)
<details open> <details open>
...@@ -130,7 +128,7 @@ python3 -m fastchat.model.apply_delta \ ...@@ -130,7 +128,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \ --target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1 --delta-path lmsys/vicuna-7b-delta-v1.1
python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
``` ```
...@@ -146,7 +144,7 @@ python3 -m fastchat.model.apply_delta \ ...@@ -146,7 +144,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \ --target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1 --delta-path lmsys/vicuna-13b-delta-v1.1
python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
``` ```
...@@ -155,28 +153,29 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast ...@@ -155,28 +153,29 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
## Inference with Command Line Interface ## Inference with Command Line Interface
```shell ```shell
python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1 python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
``` ```
## Inference with Web UI ## Inference with Web UI
```shell ```shell
python3 llmdeploy/app.py {server_ip_addresss}:33337 model_name python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
``` ```
## User Guide ## User Guide
## Quantization ## Quantization
In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users. In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`. First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`.
Then adjust `config.ini` Then adjust `config.ini`
* `use_context_fmha` changed to 0, means off
* `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
## Contributing - `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
We appreciate all contributions to LLMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline. ## Contributing
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
## Acknowledgement ## Acknowledgement
......
<div align="center"> <div align="center">
<img src="resources/llmdeploy-logo.png" width="450"/> <img src="resources/lmdeploy-logo.png" width="450"/>
<div>&nbsp;</div> <div>&nbsp;</div>
<div align="center"> <div align="center">
<b><font size="5">OpenMMLab website</font></b> <b><font size="5">OpenMMLab website</font></b>
...@@ -18,11 +18,11 @@ ...@@ -18,11 +18,11 @@
</div> </div>
<div>&nbsp;</div> <div>&nbsp;</div>
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://llmdeploy.readthedocs.io/en/latest/) [![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![codecov](https://codecov.io/gh/open-mmlab/llmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/llmdeploy) [![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
[![license](https://img.shields.io/github/license/open-mmlab/llmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE) [![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues) [![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues) [![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
[English](README.md) | 简体中文 [English](README.md) | 简体中文
...@@ -30,9 +30,9 @@ ...@@ -30,9 +30,9 @@
<div align="center"> <div align="center">
<a href="https://openmmlab.medium.com/" style="text-decoration:none;"> <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218352562-cdded397-b0f3-4ca1-b8dd-a60df8dca75b.png" width="3%" alt="" /></a> <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" /> <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;"> <a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a> <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" /> <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://twitter.com/OpenMMLab" style="text-decoration:none;"> <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
...@@ -40,33 +40,63 @@ ...@@ -40,33 +40,63 @@
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" /> <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://www.youtube.com/openmmlab" style="text-decoration:none;"> <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a> <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
</div> </div>
## 简介 ## 简介
## 安装 LMDeploy 是 [MMRazor](https://github.com/open-mmlab/mmrazor)[MMDeploy](https://github.com/open-mmlab/mmdeploy) 团队联合开发的,针对 LLM 进行轻量化、部署和服务的工具箱。它拥有以下核心功能:
- 基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) 实现的高效推理引擎 **TurboMind**, 支持 LLaMA 及其变体模型在 NVIDIA 设备上的推理
- 实现 interactive mode 推理方式。通过缓存多轮对话过程中attention的k/v,记住对话历史,从而避免重复decode历史会话
<div align="center">
<img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
</div>
- 支持 persistent batch 推理方式
TODO: gif to show what persistent batch is
## 快速上手
### 安装
```shell ```shell
conda create -n open-mmlab python=3.8 conda create -n open-mmlab python=3.8
conda activate open-mmlab conda activate open-mmlab
git clone https://github.com/open-mmlab/llmdeploy.git git clone https://github.com/open-mmlab/lmdeploy.git
cd llmdeploy cd lmdeploy
pip install -e . pip install -e .
``` ```
## 快速上手 ### 编译
下载 docker image `openmmlab/lmdeploy:latest`,挂载 lmdeploy 的数据卷,启动 container,在 container 内执行以下命令:
```shell
mkdir build && cd build
../generate.sh
make -j$(nproc) && make install
```
### 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务 ### 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务
请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform),获取 LLaMA 模型权重。 请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform),获取 LLaMA 模型权重。
执行下面任一命令,可以把 LLaMA 模型部署到 NVIDIA GPU Server: 执行下命令,可以把 LLaMA 模型部署到 NVIDIA GPU Server:
<details open> <details open>
<summary><b>7B</b></summary> <summary><b>7B</b></summary>
```shell ```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \ python3 lmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model --tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
``` ```
...@@ -77,35 +107,13 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast ...@@ -77,35 +107,13 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
<summary><b>13B</b></summary> <summary><b>13B</b></summary>
```shell ```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \ python3 lmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2 --tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
``` ```
</details> </details>
<details open>
<summary><b>33B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
</details>
<details open>
<summary><b>65B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
</details>
### 部署 [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) 服务 ### 部署 [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) 服务
<details open> <details open>
...@@ -118,7 +126,7 @@ python3 -m fastchat.model.apply_delta \ ...@@ -118,7 +126,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \ --target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1 --delta-path lmsys/vicuna-7b-delta-v1.1
python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
``` ```
...@@ -134,7 +142,7 @@ python3 -m fastchat.model.apply_delta \ ...@@ -134,7 +142,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \ --target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1 --delta-path lmsys/vicuna-13b-delta-v1.1
python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
``` ```
...@@ -143,24 +151,27 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast ...@@ -143,24 +151,27 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
## 通过命令行推理 ## 通过命令行推理
```shell ```shell
python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1 python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
``` ```
## 使用浏览器推理 ## 使用浏览器推理
```shell ```shell
python3 llmdeploy/app.py {server_ip_addresss}:33337 model_name python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
``` ```
## 量化部署 ## 量化部署
在 fp16 模式下,可以开启 kv_cache int8 量化,单卡可服务更多用户。 在 fp16 模式下,可以开启 kv_cache int8 量化,单卡可服务更多用户。
首先执行量化脚本,量化参数存放到 `deploy.py` 转换的 weight 目录下。 首先执行量化脚本,量化参数存放到 `deploy.py` 转换的 weight 目录下。
然后调整 `config.ini` 然后调整 `config.ini`
* `use_context_fmha` 改为 0,表示关闭
* `quant_policy` 设置为 4。此参数默认为 0,表示不开启 - `use_context_fmha` 改为 0,表示关闭
- `quant_policy` 设置为 4。此参数默认为 0,表示不开启
## 贡献指南 ## 贡献指南
我们感谢所有的贡献者为改进和提升 LLMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。 我们感谢所有的贡献者为改进和提升 LMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。
## 致谢 ## 致谢
......
...@@ -4,7 +4,7 @@ import time ...@@ -4,7 +4,7 @@ import time
import fire import fire
import numpy as np import numpy as np
from llmdeploy.serve.fastertransformer.chatbot import Chatbot from lmdeploy.serve.fastertransformer.chatbot import Chatbot
def infer(chatbot, session_id: int, prompt: str, output_seqlen: int, def infer(chatbot, session_id: int, prompt: str, output_seqlen: int,
......
...@@ -9,7 +9,7 @@ import fire ...@@ -9,7 +9,7 @@ import fire
import numpy as np import numpy as np
from sentencepiece import SentencePieceProcessor from sentencepiece import SentencePieceProcessor
from llmdeploy.serve.fastertransformer.chatbot import Chatbot from lmdeploy.serve.fastertransformer.chatbot import Chatbot
class Tokenizer: class Tokenizer:
......
# Copyright (c) OpenMMLab. All rights reserved. # Copyright (c) OpenMMLab. All rights reserved.
from functools import partial import os
import threading import threading
from functools import partial
from typing import Sequence from typing import Sequence
import fire import fire
import gradio as gr import gradio as gr
import os
from llmdeploy.serve.fastertransformer.chatbot import Chatbot from lmdeploy.serve.fastertransformer.chatbot import Chatbot
CSS = """ CSS = """
#container { #container {
...@@ -29,7 +29,7 @@ CSS = """ ...@@ -29,7 +29,7 @@ CSS = """
THEME = gr.themes.Soft( THEME = gr.themes.Soft(
primary_hue=gr.themes.colors.blue, primary_hue=gr.themes.colors.blue,
secondary_hue=gr.themes.colors.sky, secondary_hue=gr.themes.colors.sky,
font=[gr.themes.GoogleFont("Inconsolata"), "Arial", "sans-serif"]) font=[gr.themes.GoogleFont('Inconsolata'), 'Arial', 'sans-serif'])
def chat_stream(instruction: str, def chat_stream(instruction: str,
...@@ -64,8 +64,10 @@ def reset_all_func(instruction_txtbox: gr.Textbox, state_chatbot: gr.State, ...@@ -64,8 +64,10 @@ def reset_all_func(instruction_txtbox: gr.Textbox, state_chatbot: gr.State,
state_chatbot = [] state_chatbot = []
log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO') log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
llama_chatbot = Chatbot( llama_chatbot = Chatbot(triton_server_addr,
triton_server_addr, model_name, log_level=log_level, display=True) model_name,
log_level=log_level,
display=True)
return ( return (
llama_chatbot, llama_chatbot,
...@@ -95,21 +97,19 @@ def run(triton_server_addr: str, ...@@ -95,21 +97,19 @@ def run(triton_server_addr: str,
server_port: int = 6006): server_port: int = 6006):
with gr.Blocks(css=CSS, theme=THEME) as demo: with gr.Blocks(css=CSS, theme=THEME) as demo:
chat_interface = partial(chat_stream, model_name=model_name) chat_interface = partial(chat_stream, model_name=model_name)
reset_all = partial( reset_all = partial(reset_all_func,
reset_all_func,
model_name=model_name, model_name=model_name,
triton_server_addr=triton_server_addr) triton_server_addr=triton_server_addr)
log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO') log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
llama_chatbot = gr.State( llama_chatbot = gr.State(
Chatbot( Chatbot(triton_server_addr,
triton_server_addr,
model_name, model_name,
log_level=log_level, log_level=log_level,
display=True)) display=True))
state_chatbot = gr.State([]) state_chatbot = gr.State([])
with gr.Column(elem_id='container'): with gr.Column(elem_id='container'):
gr.Markdown('## LLMDeploy Playground') gr.Markdown('## LMDeploy Playground')
chatbot = gr.Chatbot(elem_id='chatbot', label=model_name) chatbot = gr.Chatbot(elem_id='chatbot', label=model_name)
instruction_txtbox = gr.Textbox( instruction_txtbox = gr.Textbox(
...@@ -132,8 +132,8 @@ def run(triton_server_addr: str, ...@@ -132,8 +132,8 @@ def run(triton_server_addr: str,
[instruction_txtbox], [instruction_txtbox],
) )
cancel_btn.click( cancel_btn.click(cancel_func,
cancel_func, [instruction_txtbox, state_chatbot, llama_chatbot], [instruction_txtbox, state_chatbot, llama_chatbot],
[llama_chatbot, chatbot], [llama_chatbot, chatbot],
cancels=[send_event]) cancels=[send_event])
...@@ -142,8 +142,7 @@ def run(triton_server_addr: str, ...@@ -142,8 +142,7 @@ def run(triton_server_addr: str,
[llama_chatbot, state_chatbot, chatbot, instruction_txtbox], [llama_chatbot, state_chatbot, chatbot, instruction_txtbox],
cancels=[send_event]) cancels=[send_event])
demo.queue( demo.queue(concurrency_count=4, max_size=100, api_open=True).launch(
concurrency_count=4, max_size=100, api_open=True).launch(
max_threads=10, max_threads=10,
share=True, share=True,
server_port=server_port, server_port=server_port,
......
# Copyright (c) OpenMMLab. All rights reserved. # Copyright (c) OpenMMLab. All rights reserved.
from mmengine import Registry from mmengine import Registry
MODELS = Registry('model', locations=['llmdeploy.model']) MODELS = Registry('model', locations=['lmdeploy.model'])
@MODELS.register_module(name='vicuna') @MODELS.register_module(name='vicuna')
......
...@@ -3,7 +3,7 @@ import os ...@@ -3,7 +3,7 @@ import os
import fire import fire
from llmdeploy.serve.fastertransformer.chatbot import Chatbot from lmdeploy.serve.fastertransformer.chatbot import Chatbot
def input_prompt(): def input_prompt():
......
# Copyright (c) OpenMMLab. All rights reserved. # Copyright (c) OpenMMLab. All rights reserved.
from llmdeploy.serve.fastertransformer.chatbot import \ from lmdeploy.serve.fastertransformer.chatbot import Chatbot # noqa: F401,F403
Chatbot # noqa: F401,F403
...@@ -15,8 +15,8 @@ import numpy as np ...@@ -15,8 +15,8 @@ import numpy as np
import tritonclient.grpc as grpcclient import tritonclient.grpc as grpcclient
from tritonclient.grpc.service_pb2 import ModelInferResponse from tritonclient.grpc.service_pb2 import ModelInferResponse
from llmdeploy.model import MODELS from lmdeploy.model import MODELS
from llmdeploy.serve.fastertransformer.utils import (Postprocessor, from lmdeploy.serve.fastertransformer.utils import (Postprocessor,
Preprocessor, Preprocessor,
prepare_tensor) prepare_tensor)
...@@ -107,8 +107,7 @@ class Chatbot: ...@@ -107,8 +107,7 @@ class Chatbot:
stop_words = None stop_words = None
bad_words = np.array([[[self.eos_id], [1]]], dtype=np.int32) bad_words = np.array([[[self.eos_id], [1]]], dtype=np.int32)
self.cfg = mmengine.Config( self.cfg = mmengine.Config(
dict( dict(session_len=session_len,
session_len=session_len,
top_p=top_p, top_p=top_p,
top_k=top_k, top_k=top_k,
temperature=temperature, temperature=temperature,
...@@ -203,8 +202,7 @@ class Chatbot: ...@@ -203,8 +202,7 @@ class Chatbot:
return StatusCode.TRITON_SESSION_CLOSED return StatusCode.TRITON_SESSION_CLOSED
self._session.status = 0 self._session.status = 0
for status, _, _ in self._stream_infer( for status, _, _ in self._stream_infer(self._session,
self._session,
prompt='', prompt='',
request_output_len=0, request_output_len=0,
sequence_start=False, sequence_start=False,
...@@ -244,8 +242,7 @@ class Chatbot: ...@@ -244,8 +242,7 @@ class Chatbot:
return StatusCode.TRITON_SESSION_CLOSED return StatusCode.TRITON_SESSION_CLOSED
prev_session = self._session prev_session = self._session
for status, res, _ in self._stream_infer( for status, res, _ in self._stream_infer(self._session,
self._session,
prompt='', prompt='',
request_output_len=0, request_output_len=0,
sequence_start=False, sequence_start=False,
...@@ -346,10 +343,10 @@ class Chatbot: ...@@ -346,10 +343,10 @@ class Chatbot:
session.response = '' session.response = ''
que = queue.Queue() que = queue.Queue()
producer = threading.Thread( producer = threading.Thread(target=self._stream_producer,
target=self._stream_producer, args=(self.tritonserver_addr, session, que,
args=(self.tritonserver_addr, session, que, self.cfg, input_ids, self.cfg, input_ids, input_lengths,
input_lengths, request_output_len, sequence_start, request_output_len, sequence_start,
sequence_end, preseq_length, cancel)) sequence_end, preseq_length, cancel))
producer.start() producer.start()
for state, res, tokens in self.stream_consumer( for state, res, tokens in self.stream_consumer(
...@@ -421,8 +418,7 @@ class Chatbot: ...@@ -421,8 +418,7 @@ class Chatbot:
random_seed * np.ones((1, 1), dtype=np.uint64)) random_seed * np.ones((1, 1), dtype=np.uint64))
] ]
client.start_stream(callback) client.start_stream(callback)
client.async_stream_infer( client.async_stream_infer('fastertransformer',
'fastertransformer',
inputs, inputs,
sequence_id=session.session_id, sequence_id=session.session_id,
request_id=session.request_id, request_id=session.request_id,
......
...@@ -127,8 +127,7 @@ def export(model_name: str, ...@@ -127,8 +127,7 @@ def export(model_name: str,
vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path) vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
assert _vocab_size == vocab_size, \ assert _vocab_size == vocab_size, \
f'different vocab size {_vocab_size} vs {vocab_size}' f'different vocab size {_vocab_size} vs {vocab_size}'
cfg = dict( cfg = dict(llama=dict(
llama=dict(
model_name=model_name, model_name=model_name,
head_num=head_num, head_num=head_num,
size_per_head=size_per_head, size_per_head=size_per_head,
...@@ -191,8 +190,9 @@ def deploy_llama(model_name: str, model_path: str, tokenizer_path: str, ...@@ -191,8 +190,9 @@ def deploy_llama(model_name: str, model_path: str, tokenizer_path: str,
def get_param(_name, _size): def get_param(_name, _size):
print(_name, _size) print(_name, _size)
if _name not in model_params: if _name not in model_params:
model_params[_name] = torch.zeros( model_params[_name] = torch.zeros(_size,
_size, dtype=torch.float16, device='cpu') dtype=torch.float16,
device='cpu')
return model_params[_name] return model_params[_name]
for i, ckpt_path in enumerate(checkpoints): for i, ckpt_path in enumerate(checkpoints):
...@@ -387,14 +387,11 @@ def deploy_hf(model_name: str, model_path: str, tokenizer_path: str, ...@@ -387,14 +387,11 @@ def deploy_hf(model_name: str, model_path: str, tokenizer_path: str,
def pack_model_repository(workspace_path: str): def pack_model_repository(workspace_path: str):
model_repo_dir = osp.join(workspace_path, 'model_repository') model_repo_dir = osp.join(workspace_path, 'model_repository')
os.makedirs(model_repo_dir, exist_ok=True) os.makedirs(model_repo_dir, exist_ok=True)
os.symlink( os.symlink(src=osp.join('../triton_models/interactive'),
src=osp.join('../triton_models/interactive'),
dst=osp.join(model_repo_dir, 'fastertransformer')) dst=osp.join(model_repo_dir, 'fastertransformer'))
os.symlink( os.symlink(src=osp.join('../triton_models/preprocessing'),
src=osp.join('../triton_models/preprocessing'),
dst=osp.join(model_repo_dir, 'preprocessing')) dst=osp.join(model_repo_dir, 'preprocessing'))
os.symlink( os.symlink(src=osp.join('../triton_models/postprocessing'),
src=osp.join('../triton_models/postprocessing'),
dst=osp.join(model_repo_dir, 'postprocessing')) dst=osp.join(model_repo_dir, 'postprocessing'))
......
...@@ -41,8 +41,8 @@ if [ -z "$1" ]; then ...@@ -41,8 +41,8 @@ if [ -z "$1" ]; then
--cap-add=SYS_PTRACE \ --cap-add=SYS_PTRACE \
--cap-add=SYS_ADMIN \ --cap-add=SYS_ADMIN \
--security-opt seccomp=unconfined \ --security-opt seccomp=unconfined \
--name llmdeploy \ --name lmdeploy \
-it --env NCCL_LAUNCH_MODE=GROUP lvhan028/fastertransformer:v0.0.1 \ -it --env NCCL_LAUNCH_MODE=GROUP openmmlab/lmdeploy:latest \
tritonserver \ tritonserver \
--model-repository=/workspace/models/model_repository \ --model-repository=/workspace/models/model_repository \
--allow-http=0 \ --allow-http=0 \
...@@ -72,8 +72,8 @@ for ((i = 1; i <= $#; i++)); do ...@@ -72,8 +72,8 @@ for ((i = 1; i <= $#; i++)); do
--cap-add=SYS_PTRACE \ --cap-add=SYS_PTRACE \
--cap-add=SYS_ADMIN \ --cap-add=SYS_ADMIN \
--security-opt seccomp=unconfined \ --security-opt seccomp=unconfined \
--name llmdeploy \ --name lmdeploy \
-it --env NCCL_LAUNCH_MODE=GROUP lvhan028/fastertransformer:v0.0.1 \ -it --env NCCL_LAUNCH_MODE=GROUP openmmlab/lmdeploy:latest \
tritonserver \ tritonserver \
--model-repository=/workspace/models/model_repository \ --model-repository=/workspace/models/model_repository \
--allow-http=0 \ --allow-http=0 \
......
...@@ -61,8 +61,8 @@ class Tokenizer: ...@@ -61,8 +61,8 @@ class Tokenizer:
return self.model.Decode(t) return self.model.Decode(t)
else: else:
skip_special_tokens = False skip_special_tokens = False
return self.model.decode( return self.model.decode(t,
t, skip_special_tokens=skip_special_tokens) skip_special_tokens=skip_special_tokens)
class TritonPythonModel: class TritonPythonModel:
......
...@@ -63,8 +63,8 @@ class Tokenizer: ...@@ -63,8 +63,8 @@ class Tokenizer:
return self.model.Decode(t) return self.model.Decode(t)
else: else:
skip_special_tokens = False skip_special_tokens = False
return self.model.decode( return self.model.decode(t,
t, skip_special_tokens=skip_special_tokens) skip_special_tokens=skip_special_tokens)
class TritonPythonModel: class TritonPythonModel:
...@@ -190,6 +190,7 @@ class TritonPythonModel: ...@@ -190,6 +190,7 @@ class TritonPythonModel:
for s in query for s in query
] ]
start_lengths = torch.IntTensor([[len(ids)] for ids in start_ids]) start_lengths = torch.IntTensor([[len(ids)] for ids in start_ids])
start_ids = pad_sequence( start_ids = pad_sequence(start_ids,
start_ids, batch_first=True, padding_value=self.end_id) batch_first=True,
padding_value=self.end_id)
return start_ids, start_lengths return start_ids, start_lengths
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment