Unverified Commit 46f4738c authored by lvhan028's avatar lvhan028 Committed by GitHub
Browse files

rename llmdeploy to lmdeploy (#30)

* change llmdeploy to lmdeploy

* update logo

* update readme
parent 081a6e89
......@@ -50,4 +50,4 @@ repos:
rev: v0.2.0
hooks:
- id: check-copyright
args: ["llmdeploy"]
args: ["lmdeploy"]
<div align="center">
<img src="resources/llmdeploy-logo.png" width="450"/>
<img src="resources/lmdeploy-logo.png" width="450"/>
<div>&nbsp;</div>
<div align="center">
<b><font size="5">OpenMMLab website</font></b>
......@@ -18,11 +18,11 @@
</div>
<div>&nbsp;</div>
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://llmdeploy.readthedocs.io/en/latest/)
[![codecov](https://codecov.io/gh/open-mmlab/llmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/llmdeploy)
[![license](https://img.shields.io/github/license/open-mmlab/llmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
[![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
English | [简体中文](README_zh-CN.md)
......@@ -30,9 +30,9 @@ English | [简体中文](README_zh-CN.md)
<div align="center">
<a href="https://openmmlab.medium.com/" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218352562-cdded397-b0f3-4ca1-b8dd-a60df8dca75b.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
<a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
......@@ -40,27 +40,47 @@ English | [简体中文](README_zh-CN.md)
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
</div>
## Introduction
## Installation
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
- A high throughput inference engine named as **TurboMind** based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) for LLaMA family models
- Interactive generation is supported. LMDeploy can remember the history by caching the attention k/v in multi-turn dialogues, so that it can avoid repetitive decoding of historical conversations.
<div align="center">
<img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
</div>
- Support persistent-batch inference
TODO: gif to show what persistent batch is
## Quick Start
### Installation
Below are quick steps for installation:
```shell
conda create -n open-mmlab python=3.8
conda activate open-mmlab
git clone https://github.com/open-mmlab/llmdeploy.git
cd llmdeploy
git clone https://github.com/open-mmlab/lmdeploy.git
cd lmdeploy
pip install -e .
```
## Quick Start
### Build
Pull docker image `openmmlab/llmdeploy:base` and build llmdeploy libs in its launched container
Pull docker image `openmmlab/lmdeploy:latest` and build lmdeploy libs in its launched container
```shell
mkdir build && cd build
......@@ -78,7 +98,7 @@ Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:
<summary><b>7B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
python3 lmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
......@@ -89,35 +109,13 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
<summary><b>13B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
python3 lmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
</details>
<details open>
<summary><b>33B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
</details>
<details open>
<summary><b>65B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
</details>
### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)
<details open>
......@@ -130,7 +128,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1
python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
......@@ -146,7 +144,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1
python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
......@@ -155,28 +153,29 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
## Inference with Command Line Interface
```shell
python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1
python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
```
## Inference with Web UI
```shell
python3 llmdeploy/app.py {server_ip_addresss}:33337 model_name
python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
```
## User Guide
## Quantization
In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`.
Then adjust `config.ini`
* `use_context_fmha` changed to 0, means off
* `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
## Contributing
- `use_context_fmha` changed to 0, means off
- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
We appreciate all contributions to LLMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
## Contributing
We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
## Acknowledgement
......
<div align="center">
<img src="resources/llmdeploy-logo.png" width="450"/>
<img src="resources/lmdeploy-logo.png" width="450"/>
<div>&nbsp;</div>
<div align="center">
<b><font size="5">OpenMMLab website</font></b>
......@@ -18,11 +18,11 @@
</div>
<div>&nbsp;</div>
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://llmdeploy.readthedocs.io/en/latest/)
[![codecov](https://codecov.io/gh/open-mmlab/llmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/llmdeploy)
[![license](https://img.shields.io/github/license/open-mmlab/llmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
[![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
[English](README.md) | 简体中文
......@@ -30,9 +30,9 @@
<div align="center">
<a href="https://openmmlab.medium.com/" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218352562-cdded397-b0f3-4ca1-b8dd-a60df8dca75b.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
<a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
......@@ -40,33 +40,63 @@
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
<img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
<a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
<img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
</div>
## 简介
## 安装
LMDeploy 是 [MMRazor](https://github.com/open-mmlab/mmrazor)[MMDeploy](https://github.com/open-mmlab/mmdeploy) 团队联合开发的,针对 LLM 进行轻量化、部署和服务的工具箱。它拥有以下核心功能:
- 基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) 实现的高效推理引擎 **TurboMind**, 支持 LLaMA 及其变体模型在 NVIDIA 设备上的推理
- 实现 interactive mode 推理方式。通过缓存多轮对话过程中attention的k/v,记住对话历史,从而避免重复decode历史会话
<div align="center">
<img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
</div>
- 支持 persistent batch 推理方式
TODO: gif to show what persistent batch is
## 快速上手
### 安装
```shell
conda create -n open-mmlab python=3.8
conda activate open-mmlab
git clone https://github.com/open-mmlab/llmdeploy.git
cd llmdeploy
git clone https://github.com/open-mmlab/lmdeploy.git
cd lmdeploy
pip install -e .
```
## 快速上手
### 编译
下载 docker image `openmmlab/lmdeploy:latest`,挂载 lmdeploy 的数据卷,启动 container,在 container 内执行以下命令:
```shell
mkdir build && cd build
../generate.sh
make -j$(nproc) && make install
```
### 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务
请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform),获取 LLaMA 模型权重。
执行下面任一命令,可以把 LLaMA 模型部署到 NVIDIA GPU Server:
执行下命令,可以把 LLaMA 模型部署到 NVIDIA GPU Server:
<details open>
<summary><b>7B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
python3 lmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
......@@ -77,35 +107,13 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
<summary><b>13B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
python3 lmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
</details>
<details open>
<summary><b>33B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
</details>
<details open>
<summary><b>65B</b></summary>
```shell
python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
</details>
### 部署 [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) 服务
<details open>
......@@ -118,7 +126,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1
python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
......@@ -134,7 +142,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1
python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
```
......@@ -143,24 +151,27 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
## 通过命令行推理
```shell
python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1
python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
```
## 使用浏览器推理
```shell
python3 llmdeploy/app.py {server_ip_addresss}:33337 model_name
python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
```
## 量化部署
在 fp16 模式下,可以开启 kv_cache int8 量化,单卡可服务更多用户。
首先执行量化脚本,量化参数存放到 `deploy.py` 转换的 weight 目录下。
然后调整 `config.ini`
* `use_context_fmha` 改为 0,表示关闭
* `quant_policy` 设置为 4。此参数默认为 0,表示不开启
- `use_context_fmha` 改为 0,表示关闭
- `quant_policy` 设置为 4。此参数默认为 0,表示不开启
## 贡献指南
我们感谢所有的贡献者为改进和提升 LLMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。
我们感谢所有的贡献者为改进和提升 LMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。
## 致谢
......
......@@ -4,7 +4,7 @@ import time
import fire
import numpy as np
from llmdeploy.serve.fastertransformer.chatbot import Chatbot
from lmdeploy.serve.fastertransformer.chatbot import Chatbot
def infer(chatbot, session_id: int, prompt: str, output_seqlen: int,
......
......@@ -9,7 +9,7 @@ import fire
import numpy as np
from sentencepiece import SentencePieceProcessor
from llmdeploy.serve.fastertransformer.chatbot import Chatbot
from lmdeploy.serve.fastertransformer.chatbot import Chatbot
class Tokenizer:
......
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
import os
import threading
from functools import partial
from typing import Sequence
import fire
import gradio as gr
import os
from llmdeploy.serve.fastertransformer.chatbot import Chatbot
from lmdeploy.serve.fastertransformer.chatbot import Chatbot
CSS = """
#container {
......@@ -29,7 +29,7 @@ CSS = """
THEME = gr.themes.Soft(
primary_hue=gr.themes.colors.blue,
secondary_hue=gr.themes.colors.sky,
font=[gr.themes.GoogleFont("Inconsolata"), "Arial", "sans-serif"])
font=[gr.themes.GoogleFont('Inconsolata'), 'Arial', 'sans-serif'])
def chat_stream(instruction: str,
......@@ -64,8 +64,10 @@ def reset_all_func(instruction_txtbox: gr.Textbox, state_chatbot: gr.State,
state_chatbot = []
log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
llama_chatbot = Chatbot(
triton_server_addr, model_name, log_level=log_level, display=True)
llama_chatbot = Chatbot(triton_server_addr,
model_name,
log_level=log_level,
display=True)
return (
llama_chatbot,
......@@ -95,21 +97,19 @@ def run(triton_server_addr: str,
server_port: int = 6006):
with gr.Blocks(css=CSS, theme=THEME) as demo:
chat_interface = partial(chat_stream, model_name=model_name)
reset_all = partial(
reset_all_func,
model_name=model_name,
triton_server_addr=triton_server_addr)
reset_all = partial(reset_all_func,
model_name=model_name,
triton_server_addr=triton_server_addr)
log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
llama_chatbot = gr.State(
Chatbot(
triton_server_addr,
model_name,
log_level=log_level,
display=True))
Chatbot(triton_server_addr,
model_name,
log_level=log_level,
display=True))
state_chatbot = gr.State([])
with gr.Column(elem_id='container'):
gr.Markdown('## LLMDeploy Playground')
gr.Markdown('## LMDeploy Playground')
chatbot = gr.Chatbot(elem_id='chatbot', label=model_name)
instruction_txtbox = gr.Textbox(
......@@ -132,23 +132,22 @@ def run(triton_server_addr: str,
[instruction_txtbox],
)
cancel_btn.click(
cancel_func, [instruction_txtbox, state_chatbot, llama_chatbot],
[llama_chatbot, chatbot],
cancels=[send_event])
cancel_btn.click(cancel_func,
[instruction_txtbox, state_chatbot, llama_chatbot],
[llama_chatbot, chatbot],
cancels=[send_event])
reset_btn.click(
reset_all, [instruction_txtbox, state_chatbot, llama_chatbot],
[llama_chatbot, state_chatbot, chatbot, instruction_txtbox],
cancels=[send_event])
demo.queue(
concurrency_count=4, max_size=100, api_open=True).launch(
max_threads=10,
share=True,
server_port=server_port,
server_name=server_name,
)
demo.queue(concurrency_count=4, max_size=100, api_open=True).launch(
max_threads=10,
share=True,
server_port=server_port,
server_name=server_name,
)
if __name__ == '__main__':
......
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine import Registry
MODELS = Registry('model', locations=['llmdeploy.model'])
MODELS = Registry('model', locations=['lmdeploy.model'])
@MODELS.register_module(name='vicuna')
......
......@@ -3,7 +3,7 @@ import os
import fire
from llmdeploy.serve.fastertransformer.chatbot import Chatbot
from lmdeploy.serve.fastertransformer.chatbot import Chatbot
def input_prompt():
......
# Copyright (c) OpenMMLab. All rights reserved.
from llmdeploy.serve.fastertransformer.chatbot import \
Chatbot # noqa: F401,F403
from lmdeploy.serve.fastertransformer.chatbot import Chatbot # noqa: F401,F403
......@@ -15,10 +15,10 @@ import numpy as np
import tritonclient.grpc as grpcclient
from tritonclient.grpc.service_pb2 import ModelInferResponse
from llmdeploy.model import MODELS
from llmdeploy.serve.fastertransformer.utils import (Postprocessor,
Preprocessor,
prepare_tensor)
from lmdeploy.model import MODELS
from lmdeploy.serve.fastertransformer.utils import (Postprocessor,
Preprocessor,
prepare_tensor)
@dataclass
......@@ -107,14 +107,13 @@ class Chatbot:
stop_words = None
bad_words = np.array([[[self.eos_id], [1]]], dtype=np.int32)
self.cfg = mmengine.Config(
dict(
session_len=session_len,
top_p=top_p,
top_k=top_k,
temperature=temperature,
repetition_penalty=repetition_penalty,
stop_words=stop_words,
bad_words=bad_words))
dict(session_len=session_len,
top_p=top_p,
top_k=top_k,
temperature=temperature,
repetition_penalty=repetition_penalty,
stop_words=stop_words,
bad_words=bad_words))
self.log_level = log_level
self.display = display
self.profile_generation = profile_generation
......@@ -203,12 +202,11 @@ class Chatbot:
return StatusCode.TRITON_SESSION_CLOSED
self._session.status = 0
for status, _, _ in self._stream_infer(
self._session,
prompt='',
request_output_len=0,
sequence_start=False,
sequence_end=True):
for status, _, _ in self._stream_infer(self._session,
prompt='',
request_output_len=0,
sequence_start=False,
sequence_end=True):
if status != StatusCode.TRITON_STREAM_END:
return status
......@@ -244,13 +242,12 @@ class Chatbot:
return StatusCode.TRITON_SESSION_CLOSED
prev_session = self._session
for status, res, _ in self._stream_infer(
self._session,
prompt='',
request_output_len=0,
sequence_start=False,
sequence_end=False,
cancel=True):
for status, res, _ in self._stream_infer(self._session,
prompt='',
request_output_len=0,
sequence_start=False,
sequence_end=False,
cancel=True):
if status.value < 0:
break
if status == StatusCode.TRITON_STREAM_END:
......@@ -346,11 +343,11 @@ class Chatbot:
session.response = ''
que = queue.Queue()
producer = threading.Thread(
target=self._stream_producer,
args=(self.tritonserver_addr, session, que, self.cfg, input_ids,
input_lengths, request_output_len, sequence_start,
sequence_end, preseq_length, cancel))
producer = threading.Thread(target=self._stream_producer,
args=(self.tritonserver_addr, session, que,
self.cfg, input_ids, input_lengths,
request_output_len, sequence_start,
sequence_end, preseq_length, cancel))
producer.start()
for state, res, tokens in self.stream_consumer(
self.postprocess, que, session, preseq_length, cancel, logger,
......@@ -421,13 +418,12 @@ class Chatbot:
random_seed * np.ones((1, 1), dtype=np.uint64))
]
client.start_stream(callback)
client.async_stream_infer(
'fastertransformer',
inputs,
sequence_id=session.session_id,
request_id=session.request_id,
sequence_start=sequence_start,
sequence_end=sequence_end)
client.async_stream_infer('fastertransformer',
inputs,
sequence_id=session.session_id,
request_id=session.request_id,
sequence_start=sequence_start,
sequence_end=sequence_end)
que.put(None)
@staticmethod
......
......@@ -127,29 +127,28 @@ def export(model_name: str,
vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
assert _vocab_size == vocab_size, \
f'different vocab size {_vocab_size} vs {vocab_size}'
cfg = dict(
llama=dict(
model_name=model_name,
head_num=head_num,
size_per_head=size_per_head,
vocab_size=vocab_size,
num_layer=num_layer,
rotary_embedding=size_per_head,
inter_size=inter_size,
norm_eps=norm_eps,
attn_bias=attn_bias,
start_id=bos_id,
end_id=eos_id,
weight_type='fp16',
# parameters for fastertransformer
max_batch_size=32,
max_context_token_num=4,
session_len=2048,
step_length=1,
cache_max_entry_count=48,
cache_chunk_size=8,
use_context_fmha=1,
quant_policy=0))
cfg = dict(llama=dict(
model_name=model_name,
head_num=head_num,
size_per_head=size_per_head,
vocab_size=vocab_size,
num_layer=num_layer,
rotary_embedding=size_per_head,
inter_size=inter_size,
norm_eps=norm_eps,
attn_bias=attn_bias,
start_id=bos_id,
end_id=eos_id,
weight_type='fp16',
# parameters for fastertransformer
max_batch_size=32,
max_context_token_num=4,
session_len=2048,
step_length=1,
cache_max_entry_count=48,
cache_chunk_size=8,
use_context_fmha=1,
quant_policy=0))
config = configparser.ConfigParser()
for section, key_values in cfg.items():
......@@ -191,8 +190,9 @@ def deploy_llama(model_name: str, model_path: str, tokenizer_path: str,
def get_param(_name, _size):
print(_name, _size)
if _name not in model_params:
model_params[_name] = torch.zeros(
_size, dtype=torch.float16, device='cpu')
model_params[_name] = torch.zeros(_size,
dtype=torch.float16,
device='cpu')
return model_params[_name]
for i, ckpt_path in enumerate(checkpoints):
......@@ -387,15 +387,12 @@ def deploy_hf(model_name: str, model_path: str, tokenizer_path: str,
def pack_model_repository(workspace_path: str):
model_repo_dir = osp.join(workspace_path, 'model_repository')
os.makedirs(model_repo_dir, exist_ok=True)
os.symlink(
src=osp.join('../triton_models/interactive'),
dst=osp.join(model_repo_dir, 'fastertransformer'))
os.symlink(
src=osp.join('../triton_models/preprocessing'),
dst=osp.join(model_repo_dir, 'preprocessing'))
os.symlink(
src=osp.join('../triton_models/postprocessing'),
dst=osp.join(model_repo_dir, 'postprocessing'))
os.symlink(src=osp.join('../triton_models/interactive'),
dst=osp.join(model_repo_dir, 'fastertransformer'))
os.symlink(src=osp.join('../triton_models/preprocessing'),
dst=osp.join(model_repo_dir, 'preprocessing'))
os.symlink(src=osp.join('../triton_models/postprocessing'),
dst=osp.join(model_repo_dir, 'postprocessing'))
def main(model_name: str,
......
......@@ -41,8 +41,8 @@ if [ -z "$1" ]; then
--cap-add=SYS_PTRACE \
--cap-add=SYS_ADMIN \
--security-opt seccomp=unconfined \
--name llmdeploy \
-it --env NCCL_LAUNCH_MODE=GROUP lvhan028/fastertransformer:v0.0.1 \
--name lmdeploy \
-it --env NCCL_LAUNCH_MODE=GROUP openmmlab/lmdeploy:latest \
tritonserver \
--model-repository=/workspace/models/model_repository \
--allow-http=0 \
......@@ -72,8 +72,8 @@ for ((i = 1; i <= $#; i++)); do
--cap-add=SYS_PTRACE \
--cap-add=SYS_ADMIN \
--security-opt seccomp=unconfined \
--name llmdeploy \
-it --env NCCL_LAUNCH_MODE=GROUP lvhan028/fastertransformer:v0.0.1 \
--name lmdeploy \
-it --env NCCL_LAUNCH_MODE=GROUP openmmlab/lmdeploy:latest \
tritonserver \
--model-repository=/workspace/models/model_repository \
--allow-http=0 \
......
......@@ -61,8 +61,8 @@ class Tokenizer:
return self.model.Decode(t)
else:
skip_special_tokens = False
return self.model.decode(
t, skip_special_tokens=skip_special_tokens)
return self.model.decode(t,
skip_special_tokens=skip_special_tokens)
class TritonPythonModel:
......
......@@ -63,8 +63,8 @@ class Tokenizer:
return self.model.Decode(t)
else:
skip_special_tokens = False
return self.model.decode(
t, skip_special_tokens=skip_special_tokens)
return self.model.decode(t,
skip_special_tokens=skip_special_tokens)
class TritonPythonModel:
......@@ -190,6 +190,7 @@ class TritonPythonModel:
for s in query
]
start_lengths = torch.IntTensor([[len(ids)] for ids in start_ids])
start_ids = pad_sequence(
start_ids, batch_first=True, padding_value=self.end_id)
start_ids = pad_sequence(start_ids,
batch_first=True,
padding_value=self.end_id)
return start_ids, start_lengths
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment