rename llmdeploy to lmdeploy (#30)

* change llmdeploy to lmdeploy * update logo * update readme

rename llmdeploy to lmdeploy (#30)
* change llmdeploy to lmdeploy * update logo * update readme
46f4738c · lvhan028 · GitHub · 081a6e89 · 46f4738c · 46f4738c
Unverified Commit 46f4738c authored Jun 30, 2023 by lvhan028 Committed by GitHub Jun 30, 2023
20 changed files
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -50,4 +50,4 @@ repos:
    rev: v0.2.0
    hooks:
    -   id: check-copyright
-        args: ["llmdeploy"]
+        args: ["lmdeploy"]
--- a/README.md
+++ b/README.md
 <div align="center">
-  <img src="resources/llmdeploy-logo.png" width="450"/>
+  <img src="resources/lmdeploy-logo.png" width="450"/>
  <div>&nbsp;</div>
  <div align="center">
    <b><font size="5">OpenMMLab website</font></b>
@@ -18,11 +18,11 @@
  </div>
  <div>&nbsp;</div>

-[![docs](https://img.shields.io/badge/docs-latest-blue)](https://llmdeploy.readthedocs.io/en/latest/)
-[![codecov](https://codecov.io/gh/open-mmlab/llmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/llmdeploy)
-[![license](https://img.shields.io/github/license/open-mmlab/llmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
-[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
-[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
+[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
+[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
+[![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
+[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)

 English | [简体中文](README_zh-CN.md)

@@ -30,9 +30,9 @@ English | [简体中文](README_zh-CN.md)

 <div align="center">
  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
-    <img src="https://user-images.githubusercontent.com/25839884/218352562-cdded397-b0f3-4ca1-b8dd-a60df8dca75b.png" width="3%" alt="" /></a>
+    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
-  <a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
+  <a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
@@ -40,27 +40,47 @@ English | [简体中文](README_zh-CN.md)
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
 </div>

 ## Introduction

-## Installation
+LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
+
+- A high throughput inference engine named as **TurboMind** based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) for LLaMA family models
+
+- Interactive generation is supported. LMDeploy can remember the history by caching the attention k/v in multi-turn dialogues, so that it can avoid repetitive decoding of historical conversations.
+
+<div align="center">
+  <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
+</div>
+
+- Support persistent-batch inference
+
+  TODO: gif to show what persistent batch is
+
+## Quick Start
+
+### Installation

 Below are quick steps for installation:

 ```shell
 conda create -n open-mmlab python=3.8
 conda activate open-mmlab
-git clone https://github.com/open-mmlab/llmdeploy.git
-cd llmdeploy
+git clone https://github.com/open-mmlab/lmdeploy.git
+cd lmdeploy
 pip install -e .
 ```

-## Quick Start
-
 ### Build

-Pull docker image `openmmlab/llmdeploy:base` and build llmdeploy libs in its launched container
+Pull docker image `openmmlab/lmdeploy:latest` and build lmdeploy libs in its launched container

 ```shell
 mkdir build && cd build
@@ -78,7 +98,7 @@ Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:
 <summary><b>7B</b></summary>

 ```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
+python3 lmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
    --tokenizer_path /path/to/tokenizer/model
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```
@@ -89,35 +109,13 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
 <summary><b>13B</b></summary>

 ```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
+python3 lmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 2
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```

 </details>

-<details open>
-<summary><b>33B</b></summary>
-
-```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 4
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
-```
-
-</details>
-
-<details open>
-<summary><b>65B</b></summary>
-
-```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 8
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
-```
-
-</details>
-
 ### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)

 <details open>
@@ -130,7 +128,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-7b \
  --delta-path lmsys/vicuna-7b-delta-v1.1

-python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
+python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```

@@ -146,7 +144,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1

-python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
+python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```

@@ -155,28 +153,29 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
 ## Inference with Command Line Interface

 ```shell
-python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1
+python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
 ```

 ## Inference with Web UI

 ```shell
-python3 llmdeploy/app.py {server_ip_addresss}:33337 model_name
+python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
 ```

 ## User Guide
+
 ## Quantization

 In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
 First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`.
 Then adjust `config.ini`
-* `use_context_fmha` changed to 0, means off
-* `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled

-## Contributing
+- `use_context_fmha` changed to 0, means off
+- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled

-We appreciate all contributions to LLMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
+## Contributing

+We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.

 ## Acknowledgement


--- a/README_zh-CN.md
+++ b/README_zh-CN.md
 <div align="center">
-  <img src="resources/llmdeploy-logo.png" width="450"/>
+  <img src="resources/lmdeploy-logo.png" width="450"/>
  <div>&nbsp;</div>
  <div align="center">
    <b><font size="5">OpenMMLab website</font></b>
@@ -18,11 +18,11 @@
  </div>
  <div>&nbsp;</div>

-[![docs](https://img.shields.io/badge/docs-latest-blue)](https://llmdeploy.readthedocs.io/en/latest/)
-[![codecov](https://codecov.io/gh/open-mmlab/llmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/llmdeploy)
-[![license](https://img.shields.io/github/license/open-mmlab/llmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
-[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
-[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
+[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
+[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
+[![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
+[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)

 [English](README.md) | 简体中文

@@ -30,9 +30,9 @@

 <div align="center">
  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
-    <img src="https://user-images.githubusercontent.com/25839884/218352562-cdded397-b0f3-4ca1-b8dd-a60df8dca75b.png" width="3%" alt="" /></a>
+    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
-  <a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
+  <a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
@@ -40,33 +40,63 @@
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
 </div>

 ## 简介

-## 安装
+LMDeploy 是 [MMRazor](https://github.com/open-mmlab/mmrazor) 和 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 团队联合开发的，针对 LLM 进行轻量化、部署和服务的工具箱。它拥有以下核心功能：
+
+- 基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) 实现的高效推理引擎 **TurboMind**, 支持 LLaMA 及其变体模型在 NVIDIA 设备上的推理
+
+- 实现 interactive mode 推理方式。通过缓存多轮对话过程中attention的k/v，记住对话历史，从而避免重复decode历史会话
+
+<div align="center">
+  <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
+</div>
+
+- 支持 persistent batch 推理方式
+
+  TODO: gif to show what persistent batch is
+
+## 快速上手
+
+### 安装

 ```shell
 conda create -n open-mmlab python=3.8
 conda activate open-mmlab
-git clone https://github.com/open-mmlab/llmdeploy.git
-cd llmdeploy
+git clone https://github.com/open-mmlab/lmdeploy.git
+cd lmdeploy
 pip install -e .
 ```

-## 快速上手
+### 编译
+
+下载 docker image `openmmlab/lmdeploy:latest`，挂载 lmdeploy 的数据卷，启动 container，在 container 内执行以下命令：
+
+```shell
+mkdir build && cd build
+../generate.sh
+make -j$(nproc) && make install
+```

 ### 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务

 请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)，获取 LLaMA 模型权重。

-执行下面任一命令，可以把 LLaMA 模型部署到 NVIDIA GPU Server：
+执行如下命令，可以把 LLaMA 模型部署到 NVIDIA GPU Server：

 <details open>
 <summary><b>7B</b></summary>

 ```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
+python3 lmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
    --tokenizer_path /path/to/tokenizer/model
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```
@@ -77,35 +107,13 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
 <summary><b>13B</b></summary>

 ```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
+python3 lmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 2
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```

 </details>

-<details open>
-<summary><b>33B</b></summary>
-
-```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 4
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
-```
-
-</details>
-
-<details open>
-<summary><b>65B</b></summary>
-
-```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 8
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
-```
-
-</details>
-
 ### 部署 [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) 服务

 <details open>
@@ -118,7 +126,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-7b \
  --delta-path lmsys/vicuna-7b-delta-v1.1

-python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
+python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```

@@ -134,7 +142,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1

-python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
+python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```

@@ -143,24 +151,27 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
 ## 通过命令行推理

 ```shell
-python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1
+python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
 ```

 ## 使用浏览器推理

 ```shell
-python3 llmdeploy/app.py {server_ip_addresss}:33337 model_name
+python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
 ```
+
 ## 量化部署
+
 在 fp16 模式下，可以开启 kv_cache int8 量化，单卡可服务更多用户。
 首先执行量化脚本，量化参数存放到 `deploy.py` 转换的 weight 目录下。
 然后调整 `config.ini`
-* `use_context_fmha` 改为 0，表示关闭
-* `quant_policy` 设置为 4。此参数默认为 0，表示不开启
+
+- `use_context_fmha` 改为 0，表示关闭
+- `quant_policy` 设置为 4。此参数默认为 0，表示不开启

 ## 贡献指南

-我们感谢所有的贡献者为改进和提升 LLMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。
+我们感谢所有的贡献者为改进和提升 LMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。

 ## 致谢


--- a/benchmark/profile_generation.py
+++ b/benchmark/profile_generation.py
@@ -4,7 +4,7 @@ import time
 import fire
 import numpy as np

-from llmdeploy.serve.fastertransformer.chatbot import Chatbot
+from lmdeploy.serve.fastertransformer.chatbot import Chatbot


 def infer(chatbot, session_id: int, prompt: str, output_seqlen: int,

--- a/benchmark/profile_serving.py
+++ b/benchmark/profile_serving.py
@@ -9,7 +9,7 @@ import fire
 import numpy as np
 from sentencepiece import SentencePieceProcessor

-from llmdeploy.serve.fastertransformer.chatbot import Chatbot
+from lmdeploy.serve.fastertransformer.chatbot import Chatbot


 class Tokenizer:

--- a/llmdeploy/__init__.py
+++ b/llmdeploy/__init__.py
--- a/llmdeploy/app.py
+++ b/llmdeploy/app.py
 # Copyright (c) OpenMMLab. All rights reserved.
-from functools import partial
+import os
 import threading
+from functools import partial
 from typing import Sequence

 import fire
 import gradio as gr
-import os

-from llmdeploy.serve.fastertransformer.chatbot import Chatbot
+from lmdeploy.serve.fastertransformer.chatbot import Chatbot

 CSS = """
 #container {
@@ -29,7 +29,7 @@ CSS = """
 THEME = gr.themes.Soft(
    primary_hue=gr.themes.colors.blue,
    secondary_hue=gr.themes.colors.sky,
-    font=[gr.themes.GoogleFont("Inconsolata"), "Arial", "sans-serif"])
+    font=[gr.themes.GoogleFont('Inconsolata'), 'Arial', 'sans-serif'])


 def chat_stream(instruction: str,
@@ -64,8 +64,10 @@ def reset_all_func(instruction_txtbox: gr.Textbox, state_chatbot: gr.State,

    state_chatbot = []
    log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
-    llama_chatbot = Chatbot(
-        triton_server_addr, model_name, log_level=log_level, display=True)
+    llama_chatbot = Chatbot(triton_server_addr,
+                            model_name,
+                            log_level=log_level,
+                            display=True)

    return (
        llama_chatbot,
@@ -95,21 +97,19 @@ def run(triton_server_addr: str,
        server_port: int = 6006):
    with gr.Blocks(css=CSS, theme=THEME) as demo:
        chat_interface = partial(chat_stream, model_name=model_name)
-        reset_all = partial(
-            reset_all_func,
-            model_name=model_name,
-            triton_server_addr=triton_server_addr)
+        reset_all = partial(reset_all_func,
+                            model_name=model_name,
+                            triton_server_addr=triton_server_addr)
        log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
        llama_chatbot = gr.State(
-            Chatbot(
-                triton_server_addr,
-                model_name,
-                log_level=log_level,
-                display=True))
+            Chatbot(triton_server_addr,
+                    model_name,
+                    log_level=log_level,
+                    display=True))
        state_chatbot = gr.State([])

        with gr.Column(elem_id='container'):
-            gr.Markdown('## LLMDeploy Playground')
+            gr.Markdown('## LMDeploy Playground')

            chatbot = gr.Chatbot(elem_id='chatbot', label=model_name)
            instruction_txtbox = gr.Textbox(
@@ -132,23 +132,22 @@ def run(triton_server_addr: str,
            [instruction_txtbox],
        )

-        cancel_btn.click(
-            cancel_func, [instruction_txtbox, state_chatbot, llama_chatbot],
-            [llama_chatbot, chatbot],
-            cancels=[send_event])
+        cancel_btn.click(cancel_func,
+                         [instruction_txtbox, state_chatbot, llama_chatbot],
+                         [llama_chatbot, chatbot],
+                         cancels=[send_event])

        reset_btn.click(
            reset_all, [instruction_txtbox, state_chatbot, llama_chatbot],
            [llama_chatbot, state_chatbot, chatbot, instruction_txtbox],
            cancels=[send_event])

-    demo.queue(
-        concurrency_count=4, max_size=100, api_open=True).launch(
-            max_threads=10,
-            share=True,
-            server_port=server_port,
-            server_name=server_name,
-        )
+    demo.queue(concurrency_count=4, max_size=100, api_open=True).launch(
+        max_threads=10,
+        share=True,
+        server_port=server_port,
+        server_name=server_name,
+    )


 if __name__ == '__main__':

--- a/llmdeploy/model.py
+++ b/llmdeploy/model.py
 # Copyright (c) OpenMMLab. All rights reserved.
 from mmengine import Registry

-MODELS = Registry('model', locations=['llmdeploy.model'])
+MODELS = Registry('model', locations=['lmdeploy.model'])


 @MODELS.register_module(name='vicuna')

--- a/llmdeploy/serve/__init__.py
+++ b/llmdeploy/serve/__init__.py
--- a/llmdeploy/serve/client.py
+++ b/llmdeploy/serve/client.py
@@ -3,7 +3,7 @@ import os

 import fire

-from llmdeploy.serve.fastertransformer.chatbot import Chatbot
+from lmdeploy.serve.fastertransformer.chatbot import Chatbot


 def input_prompt():

--- a/llmdeploy/serve/fastertransformer/__init__.py
+++ b/llmdeploy/serve/fastertransformer/__init__.py
 # Copyright (c) OpenMMLab. All rights reserved.
-from llmdeploy.serve.fastertransformer.chatbot import \
-    Chatbot  # noqa: F401,F403
+from lmdeploy.serve.fastertransformer.chatbot import Chatbot  # noqa: F401,F403
--- a/llmdeploy/serve/fastertransformer/chatbot.py
+++ b/llmdeploy/serve/fastertransformer/chatbot.py
@@ -15,10 +15,10 @@ import numpy as np
 import tritonclient.grpc as grpcclient
 from tritonclient.grpc.service_pb2 import ModelInferResponse

-from llmdeploy.model import MODELS
-from llmdeploy.serve.fastertransformer.utils import (Postprocessor,
-                                                     Preprocessor,
-                                                     prepare_tensor)
+from lmdeploy.model import MODELS
+from lmdeploy.serve.fastertransformer.utils import (Postprocessor,
+                                                    Preprocessor,
+                                                    prepare_tensor)


 @dataclass
@@ -107,14 +107,13 @@ class Chatbot:
            stop_words = None
            bad_words = np.array([[[self.eos_id], [1]]], dtype=np.int32)
        self.cfg = mmengine.Config(
-            dict(
-                session_len=session_len,
-                top_p=top_p,
-                top_k=top_k,
-                temperature=temperature,
-                repetition_penalty=repetition_penalty,
-                stop_words=stop_words,
-                bad_words=bad_words))
+            dict(session_len=session_len,
+                 top_p=top_p,
+                 top_k=top_k,
+                 temperature=temperature,
+                 repetition_penalty=repetition_penalty,
+                 stop_words=stop_words,
+                 bad_words=bad_words))
        self.log_level = log_level
        self.display = display
        self.profile_generation = profile_generation
@@ -203,12 +202,11 @@ class Chatbot:
            return StatusCode.TRITON_SESSION_CLOSED

        self._session.status = 0
-        for status, _, _ in self._stream_infer(
-                self._session,
-                prompt='',
-                request_output_len=0,
-                sequence_start=False,
-                sequence_end=True):
+        for status, _, _ in self._stream_infer(self._session,
+                                               prompt='',
+                                               request_output_len=0,
+                                               sequence_start=False,
+                                               sequence_end=True):
            if status != StatusCode.TRITON_STREAM_END:
                return status

@@ -244,13 +242,12 @@ class Chatbot:
            return StatusCode.TRITON_SESSION_CLOSED

        prev_session = self._session
-        for status, res, _ in self._stream_infer(
-                self._session,
-                prompt='',
-                request_output_len=0,
-                sequence_start=False,
-                sequence_end=False,
-                cancel=True):
+        for status, res, _ in self._stream_infer(self._session,
+                                                 prompt='',
+                                                 request_output_len=0,
+                                                 sequence_start=False,
+                                                 sequence_end=False,
+                                                 cancel=True):
            if status.value < 0:
                break
        if status == StatusCode.TRITON_STREAM_END:
@@ -346,11 +343,11 @@ class Chatbot:
        session.response = ''

        que = queue.Queue()
-        producer = threading.Thread(
-            target=self._stream_producer,
-            args=(self.tritonserver_addr, session, que, self.cfg, input_ids,
-                  input_lengths, request_output_len, sequence_start,
-                  sequence_end, preseq_length, cancel))
+        producer = threading.Thread(target=self._stream_producer,
+                                    args=(self.tritonserver_addr, session, que,
+                                          self.cfg, input_ids, input_lengths,
+                                          request_output_len, sequence_start,
+                                          sequence_end, preseq_length, cancel))
        producer.start()
        for state, res, tokens in self.stream_consumer(
                self.postprocess, que, session, preseq_length, cancel, logger,
@@ -421,13 +418,12 @@ class Chatbot:
                        random_seed * np.ones((1, 1), dtype=np.uint64))
                ]
            client.start_stream(callback)
-            client.async_stream_infer(
-                'fastertransformer',
-                inputs,
-                sequence_id=session.session_id,
-                request_id=session.request_id,
-                sequence_start=sequence_start,
-                sequence_end=sequence_end)
+            client.async_stream_infer('fastertransformer',
+                                      inputs,
+                                      sequence_id=session.session_id,
+                                      request_id=session.request_id,
+                                      sequence_start=sequence_start,
+                                      sequence_end=sequence_end)
        que.put(None)

    @staticmethod

--- a/llmdeploy/serve/fastertransformer/deploy.py
+++ b/llmdeploy/serve/fastertransformer/deploy.py
@@ -127,29 +127,28 @@ def export(model_name: str,
    vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
    assert _vocab_size == vocab_size, \
        f'different vocab size {_vocab_size} vs {vocab_size}'
-    cfg = dict(
-        llama=dict(
-            model_name=model_name,
-            head_num=head_num,
-            size_per_head=size_per_head,
-            vocab_size=vocab_size,
-            num_layer=num_layer,
-            rotary_embedding=size_per_head,
-            inter_size=inter_size,
-            norm_eps=norm_eps,
-            attn_bias=attn_bias,
-            start_id=bos_id,
-            end_id=eos_id,
-            weight_type='fp16',
-            # parameters for fastertransformer
-            max_batch_size=32,
-            max_context_token_num=4,
-            session_len=2048,
-            step_length=1,
-            cache_max_entry_count=48,
-            cache_chunk_size=8,
-            use_context_fmha=1,
-            quant_policy=0))
+    cfg = dict(llama=dict(
+        model_name=model_name,
+        head_num=head_num,
+        size_per_head=size_per_head,
+        vocab_size=vocab_size,
+        num_layer=num_layer,
+        rotary_embedding=size_per_head,
+        inter_size=inter_size,
+        norm_eps=norm_eps,
+        attn_bias=attn_bias,
+        start_id=bos_id,
+        end_id=eos_id,
+        weight_type='fp16',
+        # parameters for fastertransformer
+        max_batch_size=32,
+        max_context_token_num=4,
+        session_len=2048,
+        step_length=1,
+        cache_max_entry_count=48,
+        cache_chunk_size=8,
+        use_context_fmha=1,
+        quant_policy=0))

    config = configparser.ConfigParser()
    for section, key_values in cfg.items():
@@ -191,8 +190,9 @@ def deploy_llama(model_name: str, model_path: str, tokenizer_path: str,
    def get_param(_name, _size):
        print(_name, _size)
        if _name not in model_params:
-            model_params[_name] = torch.zeros(
-                _size, dtype=torch.float16, device='cpu')
+            model_params[_name] = torch.zeros(_size,
+                                              dtype=torch.float16,
+                                              device='cpu')
        return model_params[_name]

    for i, ckpt_path in enumerate(checkpoints):
@@ -387,15 +387,12 @@ def deploy_hf(model_name: str, model_path: str, tokenizer_path: str,
 def pack_model_repository(workspace_path: str):
    model_repo_dir = osp.join(workspace_path, 'model_repository')
    os.makedirs(model_repo_dir, exist_ok=True)
-    os.symlink(
-        src=osp.join('../triton_models/interactive'),
-        dst=osp.join(model_repo_dir, 'fastertransformer'))
-    os.symlink(
-        src=osp.join('../triton_models/preprocessing'),
-        dst=osp.join(model_repo_dir, 'preprocessing'))
-    os.symlink(
-        src=osp.join('../triton_models/postprocessing'),
-        dst=osp.join(model_repo_dir, 'postprocessing'))
+    os.symlink(src=osp.join('../triton_models/interactive'),
+               dst=osp.join(model_repo_dir, 'fastertransformer'))
+    os.symlink(src=osp.join('../triton_models/preprocessing'),
+               dst=osp.join(model_repo_dir, 'preprocessing'))
+    os.symlink(src=osp.join('../triton_models/postprocessing'),
+               dst=osp.join(model_repo_dir, 'postprocessing'))


 def main(model_name: str,

--- a/llmdeploy/serve/fastertransformer/service_docker_up.sh
+++ b/llmdeploy/serve/fastertransformer/service_docker_up.sh
@@ -41,8 +41,8 @@ if [ -z "$1" ]; then
        --cap-add=SYS_PTRACE \
        --cap-add=SYS_ADMIN \
        --security-opt seccomp=unconfined \
-        --name llmdeploy \
-        -it --env NCCL_LAUNCH_MODE=GROUP lvhan028/fastertransformer:v0.0.1 \
+        --name lmdeploy \
+        -it --env NCCL_LAUNCH_MODE=GROUP openmmlab/lmdeploy:latest \
        tritonserver \
        --model-repository=/workspace/models/model_repository \
        --allow-http=0 \
@@ -72,8 +72,8 @@ for ((i = 1; i <= $#; i++)); do
        --cap-add=SYS_PTRACE \
        --cap-add=SYS_ADMIN \
        --security-opt seccomp=unconfined \
-        --name llmdeploy \
-        -it --env NCCL_LAUNCH_MODE=GROUP lvhan028/fastertransformer:v0.0.1 \
+        --name lmdeploy \
+        -it --env NCCL_LAUNCH_MODE=GROUP openmmlab/lmdeploy:latest \
        tritonserver \
        --model-repository=/workspace/models/model_repository \
        --allow-http=0 \

--- a/llmdeploy/serve/fastertransformer/triton_models/interactive/1/weights
+++ b/llmdeploy/serve/fastertransformer/triton_models/interactive/1/weights
--- a/llmdeploy/serve/fastertransformer/triton_models/interactive/config.pbtxt
+++ b/llmdeploy/serve/fastertransformer/triton_models/interactive/config.pbtxt
--- a/llmdeploy/serve/fastertransformer/triton_models/postprocessing/1/model.py
+++ b/llmdeploy/serve/fastertransformer/triton_models/postprocessing/1/model.py
@@ -61,8 +61,8 @@ class Tokenizer:
            return self.model.Decode(t)
        else:
            skip_special_tokens = False
-            return self.model.decode(
-                t, skip_special_tokens=skip_special_tokens)
+            return self.model.decode(t,
+                                     skip_special_tokens=skip_special_tokens)


 class TritonPythonModel:

--- a/llmdeploy/serve/fastertransformer/triton_models/postprocessing/1/tokenizer
+++ b/llmdeploy/serve/fastertransformer/triton_models/postprocessing/1/tokenizer
--- a/llmdeploy/serve/fastertransformer/triton_models/postprocessing/config.pbtxt
+++ b/llmdeploy/serve/fastertransformer/triton_models/postprocessing/config.pbtxt
--- a/llmdeploy/serve/fastertransformer/triton_models/preprocessing/1/model.py
+++ b/llmdeploy/serve/fastertransformer/triton_models/preprocessing/1/model.py
@@ -63,8 +63,8 @@ class Tokenizer:
            return self.model.Decode(t)
        else:
            skip_special_tokens = False
-            return self.model.decode(
-                t, skip_special_tokens=skip_special_tokens)
+            return self.model.decode(t,
+                                     skip_special_tokens=skip_special_tokens)


 class TritonPythonModel:
@@ -190,6 +190,7 @@ class TritonPythonModel:
            for s in query
        ]
        start_lengths = torch.IntTensor([[len(ids)] for ids in start_ids])
-        start_ids = pad_sequence(
-            start_ids, batch_first=True, padding_value=self.end_id)
+        start_ids = pad_sequence(start_ids,
+                                 batch_first=True,
+                                 padding_value=self.end_id)
        return start_ids, start_lengths