rename llmdeploy to lmdeploy (#30)

* change llmdeploy to lmdeploy * update logo * update readme

rename llmdeploy to lmdeploy (#30)
* change llmdeploy to lmdeploy * update logo * update readme
46f4738c · lvhan028 · GitHub · 081a6e89 · 46f4738c · 46f4738c
Unverified Commit 46f4738c authored Jun 30, 2023 by lvhan028 Committed by GitHub Jun 30, 2023
20 changed files
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -50,4 +50,4 @@ repos:
    rev: v0.2.0
    hooks:
    -   id: check-copyright
-        args: ["llmdeploy"]
+        args: ["lmdeploy"]
--- a/README.md
+++ b/README.md
 <div align="center">
-  <img src="resources/llmdeploy-logo.png" width="450"/>
+  <img src="resources/lmdeploy-logo.png" width="450"/>
  <div>&nbsp;</div>
  <div align="center">
    <b><font size="5">OpenMMLab website</font></b>
@@ -18,11 +18,11 @@
  </div>
  <div>&nbsp;</div>
-[![docs](https://img.shields.io/badge/docs-latest-blue)](https://llmdeploy.readthedocs.io/en/latest/)
+[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
-[![codecov](https://codecov.io/gh/open-mmlab/llmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/llmdeploy)
+[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
-[![license](https://img.shields.io/github/license/open-mmlab/llmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
+[![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
-[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
-[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
+[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
 English | [简体中文](README_zh-CN.md)
@@ -30,9 +30,9 @@ English | [简体中文](README_zh-CN.md)
 <div align="center">
  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
-    <img src="https://user-images.githubusercontent.com/25839884/218352562-cdded397-b0f3-4ca1-b8dd-a60df8dca75b.png" width="3%" alt="" /></a>
+    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
-  <a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
+  <a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
@@ -40,27 +40,47 @@ English | [简体中文](README_zh-CN.md)
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
 </div>
 ## Introduction
-## Installation
+LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams. It has the following core features:
+- A high throughput inference engine named as **TurboMind** based on [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) for LLaMA family models
+- Interactive generation is supported. LMDeploy can remember the history by caching the attention k/v in multi-turn dialogues, so that it can avoid repetitive decoding of historical conversations.
+<div align="center">
+  <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
+</div>
+- Support persistent-batch inference
+  TODO: gif to show what persistent batch is
+## Quick Start
+### Installation
 Below are quick steps for installation:
 ```shell
 conda create -n open-mmlab python=3.8
 conda activate open-mmlab
-git clone https://github.com/open-mmlab/llmdeploy.git
+git clone https://github.com/open-mmlab/lmdeploy.git
-cd llmdeploy
+cd lmdeploy
 pip install -e .
 ```
-## Quick Start
 ### Build
-Pull docker image `openmmlab/llmdeploy:base` and build llmdeploy libs in its launched container
+Pull docker image `openmmlab/lmdeploy:latest` and build lmdeploy libs in its launched container
 ```shell
 mkdir build && cd build
@@ -78,7 +98,7 @@ Run one of the following commands to serve a LLaMA model on NVIDIA GPU server:
 <summary><b>7B</b></summary>
 ```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
+python3 lmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
    --tokenizer_path /path/to/tokenizer/model
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```
@@ -89,35 +109,13 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
 <summary><b>13B</b></summary>
 ```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
+python3 lmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 2
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```
 </details>
-<details open>
-<summary><b>33B</b></summary>
-```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 4
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
-```
-</details>
-<details open>
-<summary><b>65B</b></summary>
-```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 8
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
-```
-</details>
 ### Serving [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/)
 <details open>
@@ -130,7 +128,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-7b \
  --delta-path lmsys/vicuna-7b-delta-v1.1
-python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
+python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```
@@ -146,7 +144,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1
-python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
+python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```
@@ -155,28 +153,29 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
 ## Inference with Command Line Interface
 ```shell
-python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1
+python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
 ```
 ## Inference with Web UI
 ```shell
-python3 llmdeploy/app.py {server_ip_addresss}:33337 model_name
+python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
 ```
 ## User Guide
 ## Quantization
 In fp16 mode, kv_cache int8 quantization can be enabled, and a single card can serve more users.
 First execute the quantization script, and the quantization parameters are stored in the weight directory transformed by `deploy.py`.
 Then adjust `config.ini`
-* `use_context_fmha` changed to 0, means off
-* `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
-## Contributing
+- `use_context_fmha` changed to 0, means off
+- `quant_policy` is set to 4. This parameter defaults to 0, which means it is not enabled
-We appreciate all contributions to LLMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
+## Contributing
+We appreciate all contributions to LMDeploy. Please refer to [CONTRIBUTING.md](.github/CONTRIBUTING.md) for the contributing guideline.
 ## Acknowledgement

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
 <div align="center">
-  <img src="resources/llmdeploy-logo.png" width="450"/>
+  <img src="resources/lmdeploy-logo.png" width="450"/>
  <div>&nbsp;</div>
  <div align="center">
    <b><font size="5">OpenMMLab website</font></b>
@@ -18,11 +18,11 @@
  </div>
  <div>&nbsp;</div>
-[![docs](https://img.shields.io/badge/docs-latest-blue)](https://llmdeploy.readthedocs.io/en/latest/)
+[![docs](https://img.shields.io/badge/docs-latest-blue)](https://lmdeploy.readthedocs.io/en/latest/)
-[![codecov](https://codecov.io/gh/open-mmlab/llmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/llmdeploy)
+[![codecov](https://codecov.io/gh/open-mmlab/lmdeploy/branch/main/graph/badge.svg)](https://codecov.io/gh/open-mmlab/lmdeploy)
-[![license](https://img.shields.io/github/license/open-mmlab/llmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
+[![license](https://img.shields.io/github/license/open-mmlab/lmdeploy.svg)](https://github.com/open-mmlab/mmdeploy/tree/main/LICENSE)
-[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
-[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/llmdeploy)](https://github.com/open-mmlab/llmdeploy/issues)
+[![open issues](https://img.shields.io/github/issues-raw/open-mmlab/lmdeploy)](https://github.com/open-mmlab/lmdeploy/issues)
 [English](README.md) | 简体中文
@@ -30,9 +30,9 @@
 <div align="center">
  <a href="https://openmmlab.medium.com/" style="text-decoration:none;">
-    <img src="https://user-images.githubusercontent.com/25839884/218352562-cdded397-b0f3-4ca1-b8dd-a60df8dca75b.png" width="3%" alt="" /></a>
+    <img src="https://user-images.githubusercontent.com/25839884/219255827-67c1a27f-f8c5-46a9-811d-5e57448c61d1.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
-  <a href="https://discord.gg/raweFPmdzG" style="text-decoration:none;">
+  <a href="https://discord.com/channels/1037617289144569886/1046608014234370059" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218347213-c080267f-cbb6-443e-8532-8e1ed9a58ea9.png" width="3%" alt="" /></a>
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://twitter.com/OpenMMLab" style="text-decoration:none;">
@@ -40,33 +40,63 @@
  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
  <a href="https://www.youtube.com/openmmlab" style="text-decoration:none;">
    <img src="https://user-images.githubusercontent.com/25839884/218346691-ceb2116a-465a-40af-8424-9f30d2348ca9.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://space.bilibili.com/1293512903" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026751-d7d14cce-a7c9-4e82-9942-8375fca65b99.png" width="3%" alt="" /></a>
+  <img src="https://user-images.githubusercontent.com/25839884/218346358-56cc8e2f-a2b8-487f-9088-32480cceabcf.png" width="3%" alt="" />
+  <a href="https://www.zhihu.com/people/openmmlab" style="text-decoration:none;">
+    <img src="https://user-images.githubusercontent.com/25839884/219026120-ba71e48b-6e94-4bd4-b4e9-b7d175b5e362.png" width="3%" alt="" /></a>
 </div>
 ## 简介
-## 安装
+LMDeploy 是 [MMRazor](https://github.com/open-mmlab/mmrazor) 和 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 团队联合开发的，针对 LLM 进行轻量化、部署和服务的工具箱。它拥有以下核心功能：
+- 基于 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) 实现的高效推理引擎 **TurboMind**, 支持 LLaMA 及其变体模型在 NVIDIA 设备上的推理
+- 实现 interactive mode 推理方式。通过缓存多轮对话过程中attention的k/v，记住对话历史，从而避免重复decode历史会话
+<div align="center">
+  <img src="https://github.com/NVIDIA/FasterTransformer/blob/main/docs/images/gpt/gpt_interactive_generation.2.png?raw=true" width="600"/>
+</div>
+- 支持 persistent batch 推理方式
+  TODO: gif to show what persistent batch is
+## 快速上手
+### 安装
 ```shell
 conda create -n open-mmlab python=3.8
 conda activate open-mmlab
-git clone https://github.com/open-mmlab/llmdeploy.git
+git clone https://github.com/open-mmlab/lmdeploy.git
-cd llmdeploy
+cd lmdeploy
 pip install -e .
 ```
-## 快速上手
+### 编译
+下载 docker image `openmmlab/lmdeploy:latest`，挂载 lmdeploy 的数据卷，启动 container，在 container 内执行以下命令：
+```shell
+mkdir build && cd build
+../generate.sh
+make -j$(nproc) && make install
+```
 ### 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务
 请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)，获取 LLaMA 模型权重。
-执行下面任一命令，可以把 LLaMA 模型部署到 NVIDIA GPU Server：
+执行如下命令，可以把 LLaMA 模型部署到 NVIDIA GPU Server：
 <details open>
 <summary><b>7B</b></summary>
 ```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
+python3 lmdeploy/serve/fastertransformer/deploy.py llama-7B /path/to/llama-7b llama \
    --tokenizer_path /path/to/tokenizer/model
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```
@@ -77,35 +107,13 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
 <summary><b>13B</b></summary>
 ```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
+python3 lmdeploy/serve/fastertransformer/deploy.py llama-13B /path/to/llama-13b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 2
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```
 </details>
-<details open>
-<summary><b>33B</b></summary>
-```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-33B /path/to/llama-33b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 4
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
-```
-</details>
-<details open>
-<summary><b>65B</b></summary>
-```shell
-python3 llmdeploy/serve/fastertransformer/deploy.py llama-65B /path/to/llama-65b llama \
-    --tokenizer_path /path/to/tokenizer/model --tp 8
-bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
-```
-</details>
 ### 部署 [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/) 服务
 <details open>
@@ -118,7 +126,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-7b \
  --delta-path lmsys/vicuna-7b-delta-v1.1
-python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
+python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-7B /path/to/vicuna-7b hf
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```
@@ -134,7 +142,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1
-python3 llmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
+python3 lmdeploy/serve/fastertransformer/deploy.py vicuna-13B /path/to/vicuna-13b hf
 bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fastertransformer
 ```
@@ -143,24 +151,27 @@ bash workspace/service_docker_up.sh --lib-dir $(pwd)/build/install/backends/fast
 ## 通过命令行推理
 ```shell
-python3 llmdeploy/serve/client.py {server_ip_addresss}:33337 1
+python3 lmdeploy/serve/client.py {server_ip_addresss}:33337
 ```
 ## 使用浏览器推理
 ```shell
-python3 llmdeploy/app.py {server_ip_addresss}:33337 model_name
+python3 lmdeploy/app.py {server_ip_addresss}:33337 {model_name}
 ```
 ## 量化部署
 在 fp16 模式下，可以开启 kv_cache int8 量化，单卡可服务更多用户。
 首先执行量化脚本，量化参数存放到 `deploy.py` 转换的 weight 目录下。
 然后调整 `config.ini`
-* `use_context_fmha` 改为 0，表示关闭
-* `quant_policy` 设置为 4。此参数默认为 0，表示不开启
+- `use_context_fmha` 改为 0，表示关闭
+- `quant_policy` 设置为 4。此参数默认为 0，表示不开启
 ## 贡献指南
-我们感谢所有的贡献者为改进和提升 LLMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。
+我们感谢所有的贡献者为改进和提升 LMDeploy 所作出的努力。请参考[贡献指南](.github/CONTRIBUTING.md)来了解参与项目贡献的相关指引。
 ## 致谢

--- a/benchmark/profile_generation.py
+++ b/benchmark/profile_generation.py
@@ -4,7 +4,7 @@ import time
 import fire
 import numpy as np
-from llmdeploy.serve.fastertransformer.chatbot import Chatbot
+from lmdeploy.serve.fastertransformer.chatbot import Chatbot
 def infer(chatbot, session_id: int, prompt: str, output_seqlen: int,

--- a/benchmark/profile_serving.py
+++ b/benchmark/profile_serving.py
@@ -9,7 +9,7 @@ import fire
 import numpy as np
 from sentencepiece import SentencePieceProcessor
-from llmdeploy.serve.fastertransformer.chatbot import Chatbot
+from lmdeploy.serve.fastertransformer.chatbot import Chatbot
 class Tokenizer:

--- a/llmdeploy/__init__.py
+++ b/llmdeploy/__init__.py
--- a/llmdeploy/app.py
+++ b/llmdeploy/app.py
 # Copyright (c) OpenMMLab. All rights reserved.
-from functools import partial
+import os
 import threading
+from functools import partial
 from typing import Sequence
 import fire
 import gradio as gr
-import os
-from llmdeploy.serve.fastertransformer.chatbot import Chatbot
+from lmdeploy.serve.fastertransformer.chatbot import Chatbot
 CSS = """
 #container {
@@ -29,7 +29,7 @@ CSS = """
 THEME = gr.themes.Soft(
    primary_hue=gr.themes.colors.blue,
    secondary_hue=gr.themes.colors.sky,
-    font=[gr.themes.GoogleFont("Inconsolata"), "Arial", "sans-serif"])
+    font=[gr.themes.GoogleFont('Inconsolata'), 'Arial', 'sans-serif'])
 def chat_stream(instruction: str,
@@ -64,8 +64,10 @@ def reset_all_func(instruction_txtbox: gr.Textbox, state_chatbot: gr.State,
    state_chatbot = []
    log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
-    llama_chatbot = Chatbot(
+    llama_chatbot = Chatbot(triton_server_addr,
-        triton_server_addr, model_name, log_level=log_level, display=True)
+                            model_name,
+                            log_level=log_level,
+                            display=True)
    return (
        llama_chatbot,
@@ -95,21 +97,19 @@ def run(triton_server_addr: str,
        server_port: int = 6006):
    with gr.Blocks(css=CSS, theme=THEME) as demo:
        chat_interface = partial(chat_stream, model_name=model_name)
-        reset_all = partial(
+        reset_all = partial(reset_all_func,
-            reset_all_func,
                            model_name=model_name,
                            triton_server_addr=triton_server_addr)
        log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
        llama_chatbot = gr.State(
-            Chatbot(
+            Chatbot(triton_server_addr,
-                triton_server_addr,
                    model_name,
                    log_level=log_level,
                    display=True))
        state_chatbot = gr.State([])
        with gr.Column(elem_id='container'):
-            gr.Markdown('## LLMDeploy Playground')
+            gr.Markdown('## LMDeploy Playground')
            chatbot = gr.Chatbot(elem_id='chatbot', label=model_name)
            instruction_txtbox = gr.Textbox(
@@ -132,8 +132,8 @@ def run(triton_server_addr: str,
            [instruction_txtbox],
        )
-        cancel_btn.click(
+        cancel_btn.click(cancel_func,
-            cancel_func, [instruction_txtbox, state_chatbot, llama_chatbot],
+                         [instruction_txtbox, state_chatbot, llama_chatbot],
                         [llama_chatbot, chatbot],
                         cancels=[send_event])
@@ -142,8 +142,7 @@ def run(triton_server_addr: str,
            [llama_chatbot, state_chatbot, chatbot, instruction_txtbox],
            cancels=[send_event])
-    demo.queue(
+    demo.queue(concurrency_count=4, max_size=100, api_open=True).launch(
-        concurrency_count=4, max_size=100, api_open=True).launch(
        max_threads=10,
        share=True,
        server_port=server_port,

--- a/llmdeploy/model.py
+++ b/llmdeploy/model.py
 # Copyright (c) OpenMMLab. All rights reserved.
 from mmengine import Registry
-MODELS = Registry('model', locations=['llmdeploy.model'])
+MODELS = Registry('model', locations=['lmdeploy.model'])
 @MODELS.register_module(name='vicuna')

--- a/llmdeploy/serve/__init__.py
+++ b/llmdeploy/serve/__init__.py
--- a/llmdeploy/serve/client.py
+++ b/llmdeploy/serve/client.py
@@ -3,7 +3,7 @@ import os
 import fire
-from llmdeploy.serve.fastertransformer.chatbot import Chatbot
+from lmdeploy.serve.fastertransformer.chatbot import Chatbot
 def input_prompt():

--- a/llmdeploy/serve/fastertransformer/__init__.py
+++ b/llmdeploy/serve/fastertransformer/__init__.py
 # Copyright (c) OpenMMLab. All rights reserved.
-from llmdeploy.serve.fastertransformer.chatbot import \
+from lmdeploy.serve.fastertransformer.chatbot import Chatbot  # noqa: F401,F403
-    Chatbot  # noqa: F401,F403
--- a/llmdeploy/serve/fastertransformer/chatbot.py
+++ b/llmdeploy/serve/fastertransformer/chatbot.py
@@ -15,8 +15,8 @@ import numpy as np
 import tritonclient.grpc as grpcclient
 from tritonclient.grpc.service_pb2 import ModelInferResponse
-from llmdeploy.model import MODELS
+from lmdeploy.model import MODELS
-from llmdeploy.serve.fastertransformer.utils import (Postprocessor,
+from lmdeploy.serve.fastertransformer.utils import (Postprocessor,
                                                    Preprocessor,
                                                    prepare_tensor)
@@ -107,8 +107,7 @@ class Chatbot:
            stop_words = None
            bad_words = np.array([[[self.eos_id], [1]]], dtype=np.int32)
        self.cfg = mmengine.Config(
-            dict(
+            dict(session_len=session_len,
-                session_len=session_len,
                 top_p=top_p,
                 top_k=top_k,
                 temperature=temperature,
@@ -203,8 +202,7 @@ class Chatbot:
            return StatusCode.TRITON_SESSION_CLOSED
        self._session.status = 0
-        for status, _, _ in self._stream_infer(
+        for status, _, _ in self._stream_infer(self._session,
-                self._session,
                                               prompt='',
                                               request_output_len=0,
                                               sequence_start=False,
@@ -244,8 +242,7 @@ class Chatbot:
            return StatusCode.TRITON_SESSION_CLOSED
        prev_session = self._session
-        for status, res, _ in self._stream_infer(
+        for status, res, _ in self._stream_infer(self._session,
-                self._session,
                                                 prompt='',
                                                 request_output_len=0,
                                                 sequence_start=False,
@@ -346,10 +343,10 @@ class Chatbot:
        session.response = ''
        que = queue.Queue()
-        producer = threading.Thread(
+        producer = threading.Thread(target=self._stream_producer,
-            target=self._stream_producer,
+                                    args=(self.tritonserver_addr, session, que,
-            args=(self.tritonserver_addr, session, que, self.cfg, input_ids,
+                                          self.cfg, input_ids, input_lengths,
-                  input_lengths, request_output_len, sequence_start,
+                                          request_output_len, sequence_start,
                                          sequence_end, preseq_length, cancel))
        producer.start()
        for state, res, tokens in self.stream_consumer(
@@ -421,8 +418,7 @@ class Chatbot:
                        random_seed * np.ones((1, 1), dtype=np.uint64))
                ]
            client.start_stream(callback)
-            client.async_stream_infer(
+            client.async_stream_infer('fastertransformer',
-                'fastertransformer',
                                      inputs,
                                      sequence_id=session.session_id,
                                      request_id=session.request_id,

--- a/llmdeploy/serve/fastertransformer/deploy.py
+++ b/llmdeploy/serve/fastertransformer/deploy.py
@@ -127,8 +127,7 @@ def export(model_name: str,
    vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
    assert _vocab_size == vocab_size, \
        f'different vocab size {_vocab_size} vs {vocab_size}'
-    cfg = dict(
+    cfg = dict(llama=dict(
-        llama=dict(
        model_name=model_name,
        head_num=head_num,
        size_per_head=size_per_head,
@@ -191,8 +190,9 @@ def deploy_llama(model_name: str, model_path: str, tokenizer_path: str,
    def get_param(_name, _size):
        print(_name, _size)
        if _name not in model_params:
-            model_params[_name] = torch.zeros(
+            model_params[_name] = torch.zeros(_size,
-                _size, dtype=torch.float16, device='cpu')
+                                              dtype=torch.float16,
+                                              device='cpu')
        return model_params[_name]
    for i, ckpt_path in enumerate(checkpoints):
@@ -387,14 +387,11 @@ def deploy_hf(model_name: str, model_path: str, tokenizer_path: str,
 def pack_model_repository(workspace_path: str):
    model_repo_dir = osp.join(workspace_path, 'model_repository')
    os.makedirs(model_repo_dir, exist_ok=True)
-    os.symlink(
+    os.symlink(src=osp.join('../triton_models/interactive'),
-        src=osp.join('../triton_models/interactive'),
               dst=osp.join(model_repo_dir, 'fastertransformer'))
-    os.symlink(
+    os.symlink(src=osp.join('../triton_models/preprocessing'),
-        src=osp.join('../triton_models/preprocessing'),
               dst=osp.join(model_repo_dir, 'preprocessing'))
-    os.symlink(
+    os.symlink(src=osp.join('../triton_models/postprocessing'),
-        src=osp.join('../triton_models/postprocessing'),
               dst=osp.join(model_repo_dir, 'postprocessing'))

--- a/llmdeploy/serve/fastertransformer/service_docker_up.sh
+++ b/llmdeploy/serve/fastertransformer/service_docker_up.sh
@@ -41,8 +41,8 @@ if [ -z "$1" ]; then
        --cap-add=SYS_PTRACE \
        --cap-add=SYS_ADMIN \
        --security-opt seccomp=unconfined \
-        --name llmdeploy \
+        --name lmdeploy \
-        -it --env NCCL_LAUNCH_MODE=GROUP lvhan028/fastertransformer:v0.0.1 \
+        -it --env NCCL_LAUNCH_MODE=GROUP openmmlab/lmdeploy:latest \
        tritonserver \
        --model-repository=/workspace/models/model_repository \
        --allow-http=0 \
@@ -72,8 +72,8 @@ for ((i = 1; i <= $#; i++)); do
        --cap-add=SYS_PTRACE \
        --cap-add=SYS_ADMIN \
        --security-opt seccomp=unconfined \
-        --name llmdeploy \
+        --name lmdeploy \
-        -it --env NCCL_LAUNCH_MODE=GROUP lvhan028/fastertransformer:v0.0.1 \
+        -it --env NCCL_LAUNCH_MODE=GROUP openmmlab/lmdeploy:latest \
        tritonserver \
        --model-repository=/workspace/models/model_repository \
        --allow-http=0 \

--- a/llmdeploy/serve/fastertransformer/triton_models/interactive/1/weights
+++ b/llmdeploy/serve/fastertransformer/triton_models/interactive/1/weights
--- a/llmdeploy/serve/fastertransformer/triton_models/interactive/config.pbtxt
+++ b/llmdeploy/serve/fastertransformer/triton_models/interactive/config.pbtxt
--- a/llmdeploy/serve/fastertransformer/triton_models/postprocessing/1/model.py
+++ b/llmdeploy/serve/fastertransformer/triton_models/postprocessing/1/model.py
@@ -61,8 +61,8 @@ class Tokenizer:
            return self.model.Decode(t)
        else:
            skip_special_tokens = False
-            return self.model.decode(
+            return self.model.decode(t,
-                t, skip_special_tokens=skip_special_tokens)
+                                     skip_special_tokens=skip_special_tokens)
 class TritonPythonModel:

--- a/llmdeploy/serve/fastertransformer/triton_models/postprocessing/1/tokenizer
+++ b/llmdeploy/serve/fastertransformer/triton_models/postprocessing/1/tokenizer
--- a/llmdeploy/serve/fastertransformer/triton_models/postprocessing/config.pbtxt
+++ b/llmdeploy/serve/fastertransformer/triton_models/postprocessing/config.pbtxt
--- a/llmdeploy/serve/fastertransformer/triton_models/preprocessing/1/model.py
+++ b/llmdeploy/serve/fastertransformer/triton_models/preprocessing/1/model.py
@@ -63,8 +63,8 @@ class Tokenizer:
            return self.model.Decode(t)
        else:
            skip_special_tokens = False
-            return self.model.decode(
+            return self.model.decode(t,
-                t, skip_special_tokens=skip_special_tokens)
+                                     skip_special_tokens=skip_special_tokens)
 class TritonPythonModel:
@@ -190,6 +190,7 @@ class TritonPythonModel:
            for s in query
        ]
        start_lengths = torch.IntTensor([[len(ids)] for ids in start_ids])
-        start_ids = pad_sequence(
+        start_ids = pad_sequence(start_ids,
-            start_ids, batch_first=True, padding_value=self.end_id)
+                                 batch_first=True,
+                                 padding_value=self.end_id)
        return start_ids, start_lengths