Refactor the chat template of supported models using factory pattern (#144)

* refactor model.py and support baichuan-7b * remove model_name * remove hard session_len * export tokenizer.py to target dir * remove model_name from client * remove model_name * update * correct throughput equation * fix session.response * update serving.md * update readme * update according to review comments * update * update * update * update

Refactor the chat template of supported models using factory pattern (#144)
* refactor model.py and support baichuan-7b * remove model_name * remove hard session_len * export tokenizer.py to target dir * remove model_name from client * remove model_name * update * correct throughput equation * fix session.response * update serving.md * update readme * update according to review comments * update * update * update * update
7b470f07 · lvhan028 · GitHub · 2067862d · 7b470f07 · 7b470f07
Unverified Commit 7b470f07 authored Jul 23, 2023 by lvhan028 Committed by GitHub Jul 23, 2023
17 changed files
--- a/README.md
+++ b/README.md
@@ -54,7 +54,7 @@ The throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% highe
 Below are quick steps for installation:
 ```shell
-conda create -n lmdeploy python=3.10
+conda create -n lmdeploy python=3.10 -y
 conda activate lmdeploy
 git clone https://github.com/InternLM/lmdeploy.git
 cd lmdeploy
@@ -77,7 +77,7 @@ git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-cha
 GIT_LFS_SKIP_SMUDGE=1
 # 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
-python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b hf
+python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
 ```
@@ -85,11 +85,11 @@ python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b
 ```shell
 docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
-    python3 -m lmdeploy.turbomind.chat internlm /workspace
+    python3 -m lmdeploy.turbomind.chat /workspace
 ```
 ```{note}
-When inferring with FP16 precision, the InternLM-7B model requires at least 22.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
+When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
 ```
 #### Serving
@@ -103,7 +103,7 @@ bash workspace/service_docker_up.sh
 Then, you can communicate with the inference server by command line,
 ```shell
-python3 -m lmdeploy.serve.client {server_ip_addresss}:33337 internlm
+python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
 ```
 or webui,
@@ -114,7 +114,7 @@ python3 -m lmdeploy.app {server_ip_addresss}:33337 internlm
 ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
-For the deployment of other supported models, such as LLaMA, vicuna, you can find the guide from [here](docs/en/serving.md)
+For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
 ### Inference with PyTorch

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -53,7 +53,7 @@ TurboMind 的吞吐量超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15
 ### 安装
 ```shell
-conda create -n lmdeploy python=3.10
+conda create -n lmdeploy python=3.10 -y
 conda activate lmdeploy
 git clone https://github.com/InternLM/lmdeploy.git
 cd lmdeploy
@@ -76,7 +76,7 @@ git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-cha
 GIT_LFS_SKIP_SMUDGE=1
 # 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
-python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b hf
+python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
 ```
@@ -84,11 +84,11 @@ python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b
 ```shell
 docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
-    python3 -m lmdeploy.turbomind.chat internlm /workspace
+    python3 -m lmdeploy.turbomind.chat /workspace
 ```
 ```{note}
-turbomind 在使用 FP16 精度推理 InternLM-7B 模型时，显存开销至少需要 22.7G。建议使用 3090, V100，A100等型号的显卡
+turbomind 在使用 FP16 精度推理 InternLM-7B 模型时，显存开销至少需要 15.7G。建议使用 3090, V100，A100等型号的显卡
 ```
 #### 部署推理服务
@@ -102,18 +102,18 @@ bash workspace/service_docker_up.sh
 你可以通过命令行方式与推理服务进行对话：
 ```shell
-python3 -m lmdeploy.serve.client {server_ip_addresss}:33337 internlm
+python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
 ```
 也可以通过 WebUI 方式来对话：
 ```shell
-python3 -m lmdeploy.app {server_ip_addresss}:33337 internlm
+python3 -m lmdeploy.app {server_ip_addresss}:33337
 ```
 ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
-其他模型的部署方式，比如 LLaMA，vicuna，请参考[这里](docs/zh_cn/serving.md)
+其他模型的部署方式，比如 LLaMA，LLaMA-2，vicuna等等，请参考[这里](docs/zh_cn/serving.md)
 ### 基于 PyTorch 的推理

--- a/benchmark/README.md
+++ b/benchmark/README.md
@@ -2,7 +2,7 @@
 We provide several profiling tools to benchmark our models.
-## profiling with dataset
+## profile with dataset
 Download the dataset below or create your own dataset.
@@ -16,7 +16,6 @@ Profiling your model with `profile_throughput.py`
 python profile_throughput.py \
 ShareGPT_V3_unfiltered_cleaned_split.json \
 /path/to/your/model \
- ${ModelType} \
 --concurrency 64
 ```
@@ -27,7 +26,6 @@ python profile_throughput.py \
 ```bash
 python profile_generation.py \
 /path/to/your/model \
- ${ModelType} \
 --concurrency 8 --input_seqlen 0 --output_seqlen 2048
 ```
@@ -36,10 +34,11 @@ python profile_generation.py \
 Tools above profile models with Python API. `profile_serving.py` is used to do benchmark on serving.
 ```bash
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 python profile_serving.py \
    ${TritonServerAddress} \
-    ${ModelName} \
    /path/to/tokenizer \
-    /path/to/dataset \
+    ShareGPT_V3_unfiltered_cleaned_split.json \
    --concurrency 64
 ```
--- a/benchmark/profile_generation.py
+++ b/benchmark/profile_generation.py
@@ -7,7 +7,6 @@ from threading import Thread
 import fire
 import numpy as np
-from lmdeploy.model import MODELS
 from lmdeploy.turbomind import Tokenizer, TurboMind
@@ -74,16 +73,13 @@ def warmup(model, concurrency: int, output_seqlen: int, warmup_round: int = 4):
 def main(model_path: str,
-         model_name: str,
         concurrency: int = 1,
         input_seqlen: int = 0,
         output_seqlen: int = 512,
         test_round: int = 10):
    tokenizer_model_path = osp.join(model_path, 'triton_models', 'tokenizer')
    tokenizer = Tokenizer(tokenizer_model_path)
-    model = MODELS.get(model_name)()
+    tm_model = TurboMind(model_path=model_path)
-    stop_words = model.stop_words
-    tm_model = TurboMind(model_path=model_path, stop_words=stop_words)
    warmup(tm_model, concurrency, output_seqlen)
@@ -127,7 +123,8 @@ def main(model_path: str,
    token_latency_min = np.min(stats[:, 2], axis=0)
    token_latency_max = np.max(stats[:, 2], axis=0)
    token_latency_ave = np.mean(stats[:, 2], axis=0)
-    throughput = np.sum(stats[:, 1], axis=0) / np.sum(stats[:, 2], axis=0)
+    throughput = np.sum(stats[:, 1], axis=0) / np.sum(stats[:, 2],
+                                                      axis=0) * concurrency
    print(f'\n{"-" * 50}\nconcurrency: {concurrency}, input_tokens: '
          f'{input_seqlen}, output_tokens: {output_seqlen}\n'
          f'elapsed_time: {elapsed_time:.2f}s\n'
@@ -136,7 +133,7 @@ def main(model_path: str,
          f'{first_token_latency_ave:.2f}s\ntoken latency(min, max, ave): '
          f'{token_latency_min:.2f}s, {token_latency_max:.2f}s, '
          f'{token_latency_ave:.2f}s\n'
-          f'throughput per threads: {throughput} token/s\n{"-" * 50}')
+          f'throughput: {throughput} token/s\n{"-" * 50}')
 if __name__ == '__main__':

--- a/benchmark/profile_serving.py
+++ b/benchmark/profile_serving.py
@@ -53,29 +53,25 @@ def infer(chatbot, session_id: int, req_que: mp.Queue, res_que: mp.Queue):
 def warmup(tritonserver_addr: str,
-           model_name: str,
           concurrency: int,
-           session_len: int,
           output_seqlen: int,
           warmup_round: int = 4):
    print('start to warmup ...')
    def _infer(_chatbot, session_id):
        for _ in range(warmup_round):
-            for _, _, _ in chatbot.stream_infer(
+            for _, _, _ in _chatbot.stream_infer(
                    session_id,
                    prompt='',
                    request_output_len=output_seqlen,
                    sequence_start=True,
                    sequence_end=True):
                continue
-            chatbot.reset_session()
+            _chatbot.reset_session()
    _start = time.perf_counter()
    chatbots = [
        Chatbot(tritonserver_addr=tritonserver_addr,
-                model_name=model_name,
-                session_len=session_len,
                ignore_eos=True,
                profile_generation=True) for _ in range(concurrency)
    ]
@@ -90,8 +86,8 @@ def warmup(tritonserver_addr: str,
    print(f'end warmup, elapsed time: {round(_end - _start, 2)} s')
-def read_dataset(tritonserver_addr, tokenizer_path: str, dataset_path: str,
+def read_dataset(tokenizer_path: str, dataset_path: str, samples: int,
-                 samples: int, test_round: int, session_len: int):
+                 test_round: int, session_len: int):
    start = time.perf_counter()
    with open(dataset_path) as f:
        dataset = json.load(f)
@@ -134,24 +130,20 @@ def read_dataset(tritonserver_addr, tokenizer_path: str, dataset_path: str,
 def main(tritonserver_addr: str,
-         model_name: str,
         tokenizer_path: str,
         dataset_path: str,
         concurrency: int = 1,
         session_len: int = 2048,
-         samples: int = 2000,
+         samples: int = 1000,
         test_round: int = 1):
-    warmup(tritonserver_addr, model_name, concurrency, session_len,
+    warmup(tritonserver_addr, concurrency, session_len - 1)
+    req_que = read_dataset(tokenizer_path, dataset_path, samples, test_round,
                           session_len)
-    req_que = read_dataset(tritonserver_addr, tokenizer_path, dataset_path,
-                           samples, test_round, session_len)
    res_que = mp.Queue()
    procs = []
    _start = time.perf_counter()
    for i in range(concurrency):
        chatbot = Chatbot(tritonserver_addr=tritonserver_addr,
-                          model_name=model_name,
-                          session_len=session_len,
                          display=False,
                          profile_serving=True,
                          ignore_eos=True)

--- a/benchmark/profile_throughput.py
+++ b/benchmark/profile_throughput.py
@@ -8,7 +8,6 @@ from typing import List, Tuple
 import fire
-from lmdeploy.model import MODELS
 from lmdeploy.turbomind import Tokenizer, TurboMind
@@ -55,13 +54,11 @@ def sample_requests(
 class Engine:
-    def __init__(self, model_path: str, model_name: str):
+    def __init__(self, model_path: str):
        tokenizer_model_path = osp.join(model_path, 'triton_models',
                                        'tokenizer')
        tokenizer = Tokenizer(tokenizer_model_path)
-        model = MODELS.get(model_name)()
+        tm_model = TurboMind(model_path=model_path)
-        stop_words = model.stop_words
-        tm_model = TurboMind(model_path=model_path, stop_words=stop_words)
        self.tm_model = tm_model
        self.tokenizer = tokenizer
@@ -119,11 +116,10 @@ class Engine:
 def main(dataset: str,
         model_path: str,
-         model_name: str,
         concurrency: int = 1,
         num_prompts: int = 1000):
-    engine = Engine(model_path, model_name)
+    engine = Engine(model_path)
    tokenizer = engine.tokenizer
    requests = sample_requests(dataset, num_prompts, tokenizer)

--- a/docs/en/serving.md
+++ b/docs/en/serving.md
 # Serving a model
+## Serving [LLaMA-2](https://github.com/facebookresearch/llama)
+You can download [llama-2 models from huggingface](https://huggingface.co/meta-llama) and serve them like below:
+<details open>
+<summary><b>7B</b></summary>
+```shell
+python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-7b-chat-hf
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>13B</b></summary>
+```shell
+python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-13b-chat-hf --tp 2
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>70B</b></summary>
+```shell
+python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-70b-chat-hf --tp 8
+bash workspace/service_docker_up.sh
+```
+</details>
 ## Serving [LLaMA](https://github.com/facebookresearch/llama)
 Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)
@@ -8,7 +42,7 @@ Weights for the LLaMA models can be obtained from by filling out [this form](htt
 <summary><b>7B</b></summary>
 ```shell
-python3 -m lmdeploy.serve.turbomind.deploy llama-7B /path/to/llama-7b llama \
+python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-7b llama \
    --tokenizer_path /path/to/tokenizer/model
 bash workspace/service_docker_up.sh
 ```
@@ -19,7 +53,7 @@ bash workspace/service_docker_up.sh
 <summary><b>13B</b></summary>
 ```shell
-python3 -m lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
+python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-13b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 2
 bash workspace/service_docker_up.sh
 ```
@@ -30,7 +64,7 @@ bash workspace/service_docker_up.sh
 <summary><b>30B</b></summary>
 ```shell
-python3 -m lmdeploy.serve.turbomind.deploy llama-32B /path/to/llama-30b llama \
+python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-30b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 4
 bash workspace/service_docker_up.sh
 ```
@@ -41,7 +75,7 @@ bash workspace/service_docker_up.sh
 <summary><b>65B</b></summary>
 ```shell
-python3 -m lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \
+python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-65b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 8
 bash workspace/service_docker_up.sh
 ```
@@ -60,7 +94,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-7b \
  --delta-path lmsys/vicuna-7b-delta-v1.1
-python3 -m lmdeploy.serve.turbomind.deploy vicuna-7B /path/to/vicuna-7b hf
+python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-7b
 bash workspace/service_docker_up.sh
 ```
@@ -76,7 +110,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1
-python3 -m lmdeploy.serve.turbomind.deploy vicuna-13B /path/to/vicuna-13b hf
+python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-13b
 bash workspace/service_docker_up.sh
 ```

--- a/docs/zh_cn/serving.md
+++ b/docs/zh_cn/serving.md
 # 模型服务
+## 部署 [LLaMA-2](https://github.com/facebookresearch/llama) 服务
+请从[这里](https://huggingface.co/meta-llama) 下载 llama2 模型，参考如下命令部署服务：
+<details open>
+<summary><b>7B</b></summary>
+```shell
+python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-7b-chat-hf
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>13B</b></summary>
+```shell
+python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-13b-chat-hf --tp 2
+bash workspace/service_docker_up.sh
+```
+</details>
+<details open>
+<summary><b>70B</b></summary>
+```shell
+python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-70b-chat-hf --tp 8
+bash workspace/service_docker_up.sh
+```
+</details>
 ## 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务
 请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)，获取 LLaMA 模型权重
@@ -8,7 +42,7 @@
 <summary><b>7B</b></summary>
 ```shell
-python3 -m lmdeploy.serve.turbomind.deploy llama-7B /path/to/llama-7b llama \
+python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-7b llama \
    --tokenizer_path /path/to/tokenizer/model
 bash workspace/service_docker_up.sh
 ```
@@ -19,7 +53,7 @@ bash workspace/service_docker_up.sh
 <summary><b>13B</b></summary>
 ```shell
-python3 -m lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
+python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-13b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 2
 bash workspace/service_docker_up.sh
 ```
@@ -30,7 +64,7 @@ bash workspace/service_docker_up.sh
 <summary><b>30B</b></summary>
 ```shell
-python3 -m lmdeploy.serve.turbomind.deploy llama-32B /path/to/llama-30b llama \
+python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-30b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 4
 bash workspace/service_docker_up.sh
 ```
@@ -41,7 +75,7 @@ bash workspace/service_docker_up.sh
 <summary><b>65B</b></summary>
 ```shell
-python3 -m lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \
+python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-65b llama \
    --tokenizer_path /path/to/tokenizer/model --tp 8
 bash workspace/service_docker_up.sh
 ```
@@ -60,7 +94,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-7b \
  --delta-path lmsys/vicuna-7b-delta-v1.1
-python3 -m lmdeploy.serve.turbomind.deploy vicuna-7B /path/to/vicuna-7b hf
+python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-7b
 bash workspace/service_docker_up.sh
 ```
@@ -76,7 +110,7 @@ python3 -m fastchat.model.apply_delta \
  --target-model-path /path/to/vicuna-13b \
  --delta-path lmsys/vicuna-13b-delta-v1.1
-python3 -m lmdeploy.serve.turbomind.deploy vicuna-13B /path/to/vicuna-13b hf
+python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-13b
 bash workspace/service_docker_up.sh
 ```

--- a/lmdeploy/app.py
+++ b/lmdeploy/app.py
@@ -101,28 +101,26 @@ def cancel_func(
 def run(triton_server_addr: str,
-        model_name: str,
        server_name: str = 'localhost',
        server_port: int = 6006):
    """chat with AI assistant through web ui.
    Args:
        triton_server_addr (str): the communication address of inference server
-        model_name (str): the name of the deployed model
        server_name (str): the ip address of gradio server
        server_port (int): the port of gradio server
    """
    with gr.Blocks(css=CSS, theme=THEME) as demo:
+        log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
+        _chatbot = Chatbot(triton_server_addr,
+                           log_level=log_level,
+                           display=True)
+        model_name = _chatbot.model_name
        chat_interface = partial(chat_stream, model_name=model_name)
        reset_all = partial(reset_all_func,
                            model_name=model_name,
                            triton_server_addr=triton_server_addr)
-        log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
+        llama_chatbot = gr.State(_chatbot)
-        llama_chatbot = gr.State(
-            Chatbot(triton_server_addr,
-                    model_name,
-                    log_level=log_level,
-                    display=True))
        state_chatbot = gr.State([])
        with gr.Column(elem_id='container'):

--- a/lmdeploy/model.py
+++ b/lmdeploy/model.py
@@ -4,11 +4,43 @@ from mmengine import Registry
 MODELS = Registry('model', locations=['lmdeploy.model'])
+@MODELS.register_module(name='llama')
+class BaseModel:
+    """Base model."""
+    def __init__(self):
+        self.session_len = 2048
+        self.top_p = 0.8
+        self.top_k = None
+        self.temperature = 0.8
+        self.repetition_penalty = 1.0
+    @staticmethod
+    def get_prompt(prompt, sequence_start=True):
+        """Return the prompt that is concatenated with other elements in the
+        chat template.
+        Args:
+            prompt (str): user's input prompt
+            sequence_start (bool): indicator for the first round chat of a
+               session sequence
+        Returns:
+            str: the concatenated prompt
+        """
+        return prompt
+    @property
+    def stop_words(self):
+        """Return the stop-words' token ids."""
+        return None
 @MODELS.register_module(name='vicuna')
-class Vicuna:
+class Vicuna(BaseModel):
    """Chat template of vicuna model."""
    def __init__(self):
+        super().__init__()
        self.system = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. """  # noqa: E501
        self.user = 'USER'
        self.assistant = 'ASSISTANT'
@@ -29,17 +61,20 @@ class Vicuna:
        else:
            return f'</s>{self.user}: {prompt} {self.assistant}:'
-    @property
-    def stop_words(self):
-        """Return the stop-words' token ids."""
-        return None
 @MODELS.register_module(name='internlm')
-class InternLM:
+class InternLM(BaseModel):
+    def __init__(self):
+        super().__init__()
+@MODELS.register_module(name='internlm-chat-7b')
+class InternLMChat7B(BaseModel):
    """Chat template of InternLM model."""
    def __init__(self):
+        super().__init__()
        self.system = ''
        self.user = '<|User|>'
        self.eoh = '<eoh>'
@@ -70,38 +105,21 @@ class InternLM:
        return [103027, 103028]
-@MODELS.register_module(name='llama')
+@MODELS.register_module(name='internlm-chat-7b-8k')
-class Llama:
+class InternLMChat7B8K(InternLMChat7B):
-    """Chat template of LLaMA model."""
    def __init__(self):
-        pass
+        super(InternLMChat7B8K, self).__init__()
+        self.session_len = 8192
-    def get_prompt(self, prompt, sequence_start=True):
-        """Return the prompt that is concatenated with other elements in the
-        chat template.
-        Args:
-            prompt (str): user's input prompt
-            sequence_start (bool): indicator for the first round chat of a
-               session sequence
-        Returns:
-            str: the concatenated prompt
-        """
-        return prompt
-    @property
-    def stop_words(self):
-        """Return the stop-words' token ids."""
-        return None
 @MODELS.register_module(name='puyu')
-class Puyu:
+class Puyu(BaseModel):
    """Chat template of puyu model.This is only for internal usage in Shanghai
    AI Laboratory."""
    def __init__(self):
+        super().__init__()
        self.system = """meta instruction
 You are an AI assistant whose name is InternLM (书生·浦语).
 - 书生·浦语 is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
@@ -125,12 +143,20 @@ conversation"""  # noqa: E501
        return [45623]
+@MODELS.register_module(name='baichuan-7b')
+class Baichuan7B(BaseModel):
+    def __init__(self):
+        super().__init__()
+        self.repetition_penalty = 1.1
 @MODELS.register_module(name='llama2')
-class Llama2:
+class Llama2(BaseModel):
    """Chat template of LLaMA2 model."""
    def __init__(self):
+        super().__init__()
        B_INST, E_INST = '[INST]', '[/INST]'
        B_SYS, E_SYS = '<<SYS>>\n', '\n<</SYS>>\n\n'
@@ -144,6 +170,7 @@ If a question does not make any sense, or is not factually coherent, explain why
        self.b_sys = B_SYS
        self.e_sys = E_SYS
        self.default_sys_prompt = DEFAULT_SYSTEM_PROMPT
+        self.session_len = 4096
    def get_prompt(self, prompt, sequence_start=True):
        """Return the prompt that is concatenated with other elements in the
@@ -163,19 +190,15 @@ If a question does not make any sense, or is not factually coherent, explain why
        return f'{self.b_inst} {prompt} {self.e_inst} '
-    @property
-    def stop_words(self):
-        """Return the stop-words' token ids."""
-        return None
 def main(model_name: str = 'test'):
    assert model_name in MODELS.module_dict.keys(), \
        f"'{model_name}' is not supported. " \
        f'The supported models are: {MODELS.module_dict.keys()}'
-    model = MODELS.get('vicuna--1')()
+    model = MODELS.get(model_name)()
    prompt = model.get_prompt(prompt='hi')
    print(prompt)
+    print(f'session_len: {model.session_len}')
 if __name__ == '__main__':

--- a/lmdeploy/serve/client.py
+++ b/lmdeploy/serve/client.py
@@ -13,7 +13,7 @@ def input_prompt():
    return '\n'.join(iter(input, sentinel))
-def main(tritonserver_addr: str, model_name: str, session_id: int = 1):
+def main(tritonserver_addr: str, session_id: int = 1):
    """An example to communicate with inference server through the command line
    interface.
@@ -24,10 +24,7 @@ def main(tritonserver_addr: str, model_name: str, session_id: int = 1):
        session_id (int): the identical id of a session
    """
    log_level = os.environ.get('SERVICE_LOG_LEVEL', 'WARNING')
-    chatbot = Chatbot(tritonserver_addr,
+    chatbot = Chatbot(tritonserver_addr, log_level=log_level, display=True)
-                      model_name,
-                      log_level=log_level,
-                      display=True)
    nth_round = 1
    while True:
        prompt = input_prompt()

--- a/lmdeploy/serve/turbomind/chatbot.py
+++ b/lmdeploy/serve/turbomind/chatbot.py
@@ -64,15 +64,6 @@ class Chatbot:
        tritonserver_addr (str): communicating address '<ip>:<port>' of
            triton inference server
        model_name (str): name of the to-be-deployed mode
-        session_len (int): the maximum context length of the model
-        top_p (float): If set to float < 1, only the smallest set of most
-            probable tokens with probabilities that add up to top_p or higher
-            are kept for generation.
-        top_k (int): The number of the highest probability vocabulary tokens to
-            keep for top-k-filtering
-        temperature (float): to modulate the next token probability
-        repetition_penalty (float): The parameter for repetition penalty.
-            1.0 means no penalty
        log_level (int): the level of the log
        display (bool): display the generated text on consolo or not
        profile_generation (bool): profile token generation or not
@@ -80,24 +71,18 @@ class Chatbot:
    def __init__(self,
                 tritonserver_addr: str,
-                 model_name: str,
-                 session_len: int = 2048,
-                 top_p: float = 0.8,
-                 top_k: int = None,
-                 temperature: float = 0.8,
-                 repetition_penalty: float = 1.0,
                 ignore_eos: bool = False,
                 log_level: int = logging.INFO,
                 display: bool = False,
                 profile_generation: bool = False,
                 profile_serving: bool = False):
-        assert model_name in MODELS.module_dict.keys(), \
+        self.tritonserver_addr = tritonserver_addr
-            f"'{model_name}' is not supported. " \
+        self.model_name = self._get_model_name()
+        assert self.model_name in MODELS.module_dict.keys(), \
+            f"'{self.model_name}' is not supported. " \
            f'The supported models are: {MODELS.module_dict.keys()}'
-        self.model_name = model_name
        self.model = MODELS.get(self.model_name)()
        self._session = None
-        self.tritonserver_addr = tritonserver_addr
        self.preprocess = Preprocessor(tritonserver_addr)
        self.postprocess = Postprocessor(tritonserver_addr)
        self.bos_id = self._get_bos()
@@ -108,11 +93,11 @@ class Chatbot:
            stop_words = None
            bad_words = np.array([[[self.eos_id], [1]]], dtype=np.int32)
        self.cfg = mmengine.Config(
-            dict(session_len=session_len,
+            dict(session_len=self.model.session_len,
-                 top_p=top_p,
+                 top_p=self.model.top_p,
-                 top_k=top_k,
+                 top_k=self.model.top_k,
-                 temperature=temperature,
+                 temperature=self.model.temperature,
-                 repetition_penalty=repetition_penalty,
+                 repetition_penalty=self.model.repetition_penalty,
                 stop_words=stop_words,
                 bad_words=bad_words))
        self.log_level = log_level
@@ -167,12 +152,16 @@ class Chatbot:
                                                      request_output_len,
                                                      sequence_start,
                                                      sequence_end):
-            yield status, res, tokens
            if status.value < 0:
-                return
+                break
+            else:
+                yield status, res, tokens
+        if status.value == 0:
            self._session.histories = \
                self._session.histories + self._session.prompt + \
                self._session.response
+        else:
+            yield status, res, tokens
    def end(self, session_id: int, *args, **kwargs):
        """end a session. Triton inference server will release the session's
@@ -208,11 +197,11 @@ class Chatbot:
                                               request_output_len=0,
                                               sequence_start=False,
                                               sequence_end=True):
-            if status != StatusCode.TRITON_STREAM_END:
+            if status.value < 0:
-                return status
+                break
        self.reset_session()
-        return StatusCode.TRITON_STREAM_END
+        return status
    def cancel(self, session_id: int, *args, **kwargs):
        """Cancel the session during generating tokens.
@@ -243,6 +232,7 @@ class Chatbot:
            return StatusCode.TRITON_SESSION_CLOSED
        prev_session = self._session
+        status, res = None, None
        for status, res, _ in self._stream_infer(self._session,
                                                 prompt='',
                                                 request_output_len=0,
@@ -254,7 +244,7 @@ class Chatbot:
        if status == StatusCode.TRITON_STREAM_END:
            logger.info(f'cancel session {session_id} successfully')
            if prev_session.histories:
-                logger.warn(f'TODO: start to recover session {session_id}')
+                logger.warning(f'TODO: start to recover session {session_id}')
        else:
            logger.info(f'cancel session {session_id} failed: {res}')
        return status
@@ -295,7 +285,7 @@ class Chatbot:
                                               sequence_start=True,
                                               sequence_end=False):
            if status.value < 0:
-                return status
+                break
        self._session.histories = histories
        return status
@@ -314,6 +304,14 @@ class Chatbot:
        """set session."""
        self._session = value
+    def _get_model_name(self):
+        with grpcclient.InferenceServerClient(
+                self.tritonserver_addr) as client:
+            model_config = client.get_model_config(model_name='turbomind',
+                                                   as_json=True)
+            return model_config['config']['parameters']['model_name'][
+                'string_value']
    def _get_bos(self):
        """return bos token id."""
        token_ids, _ = self.preprocess('<BOS>')
@@ -422,16 +420,12 @@ class Chatbot:
                                          request_output_len, sequence_start,
                                          sequence_end, preseq_length, cancel))
        producer.start()
-        for state, res, tokens in self.stream_consumer(self.postprocess, que,
+        for status, res, n_token in self.stream_consumer(
-                                                       session, input_tokens,
+                self.postprocess, que, session, input_tokens, preseq_length,
-                                                       preseq_length, cancel,
+                cancel, logger, self.display, self.profile_generation,
-                                                       logger, self.display,
-                                                       self.profile_generation,
                self.eos_id):
-            if state.value < 0:
+            yield status, res, n_token
-                yield state, res, 0
-            else:
-                yield state, res, tokens
        producer.join()
        self._session = que.get()
        curseq_length = self._session.sequence_length
@@ -543,11 +537,13 @@ class Chatbot:
            tuple: status, text, generated token number
        """
        offset = n_input_token + preseq_length
+        status, res, n_token = None, '', 0
        while True:
            result = res_queue.get()
            if result is None:
-                yield (StatusCode.TRITON_STREAM_END, session.response,
+                status = StatusCode.TRITON_STREAM_END
-                       session.sequence_length - offset)
+                res = session.response
+                n_token = session.sequence_length - offset
                session.status = StatusCode.TRITON_STREAM_END
                break
            if 'errcode' in result:
@@ -555,7 +551,10 @@ class Chatbot:
                             f"{result['errcode']}, {result['errmsg']}, "
                             f'token {session.sequence_length}')
                session.sequence_length = preseq_length
-                yield result['errcode'], result['errmsg'], 0
+                session.response = ''
+                status = StatusCode.TRITON_SERVER_ERR
+                res = f"{result['errcode']}, {result['errmsg']}"
+                n_token = 0
                break
            if cancel:
                continue
@@ -601,3 +600,4 @@ class Chatbot:
        res_queue.put(session)
        if display:
            print('\n')
+        yield status, res, n_token
--- a/lmdeploy/serve/turbomind/deploy.py
+++ b/lmdeploy/serve/turbomind/deploy.py
@@ -12,9 +12,16 @@ import safetensors
 import torch
 from sentencepiece import SentencePieceProcessor
+from lmdeploy.model import MODELS
 supported_formats = ['llama', 'hf']
+def get_package_root_path():
+    import importlib.resources as pkg_resources
+    return pkg_resources.path('lmdeploy', '')
 def create_workspace(_path: str):
    """Create a workspace.
@@ -164,6 +171,7 @@ def export(model_name: str,
            save_bin(param_data, param_name)
    # export config and save it to {out_dir}/config.ini
+    model = MODELS.get(model_name)()
    vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
    assert _vocab_size >= vocab_size, \
        f'different vocab size {_vocab_size} vs {vocab_size}'
@@ -184,7 +192,7 @@ def export(model_name: str,
        # parameters for turbomind
        max_batch_size=32,
        max_context_token_num=4,
-        session_len=2056,
+        session_len=model.session_len + 8,
        step_length=1,
        cache_max_entry_count=48,
        cache_chunk_size=1,
@@ -226,6 +234,9 @@ def deploy_llama(model_name: str, model_path: str, tokenizer_path: str,
    if osp.exists(tokenizer_path):
        shutil.copy(tokenizer_path,
                    osp.join(triton_models_path, 'tokenizer/tokenizer.model'))
+        with get_package_root_path() as root_path:
+            shutil.copy(osp.join(root_path, 'turbomind/tokenizer.py'),
+                        osp.join(triton_models_path, 'tokenizer'))
    else:
        print(f'tokenizer model {tokenizer_path} does not exist')
        return False
@@ -352,6 +363,9 @@ def deploy_hf(model_name: str, model_path: str, tokenizer_path: str,
                json_path = osp.join(model_path, _file)
                shutil.copy(json_path,
                            osp.join(triton_models_path, 'tokenizer', _file))
+        with get_package_root_path() as root_path:
+            shutil.copy(osp.join(root_path, 'turbomind/tokenizer.py'),
+                        osp.join(triton_models_path, 'tokenizer'))
    else:
        print(f'tokenizer model {tokenizer_path} does not exist')
        exit(-1)
@@ -495,7 +509,7 @@ def pack_model_repository(workspace_path: str):
 def main(model_name: str,
         model_path: str,
-         model_format: str,
+         model_format: str = 'hf',
         tokenizer_path: str = None,
         dst_path: str = './workspace',
         tp: int = 1):
@@ -511,6 +525,9 @@ def main(model_name: str,
        dst_path (str): the destination path that saves outputs
        tp (int): the number of GPUs used for tensor parallelism
    """
+    assert model_name in MODELS.module_dict.keys(), \
+        f"'{model_name}' is not supported. " \
+        f'The supported models are: {MODELS.module_dict.keys()}'
    if model_format not in supported_formats:
        print(f'the model format "{model_format}" is not supported. '
@@ -539,8 +556,11 @@ def main(model_name: str,
    # update `tensor_para_size` in `triton_models/interactive/config.pbtxt`
    with open(osp.join(triton_models_path, 'interactive/config.pbtxt'),
              'a') as f:
-        param = 'parameters {\n  key: "tensor_para_size"\n  value: {\n    ' \
+        param = \
-            'string_value: ' + f'"{tp}"\n' + '  }\n}\n'
+            'parameters {\n  key: "tensor_para_size"\n  value: {\n    ' \
+            'string_value: ' + f'"{tp}"\n' + '  }\n}\n' + \
+            'parameters {\n  key: "model_name"\n  value: {\n    ' \
+            'string_value: ' + f'"{model_name}"\n' + '  }\n}\n'
        f.write(param)
    if not res:
        print(f'deploy model "{model_name}" via turbomind failed')

--- a/lmdeploy/serve/turbomind/triton_models/postprocessing/1/model.py
+++ b/lmdeploy/serve/turbomind/triton_models/postprocessing/1/model.py
@@ -2,88 +2,15 @@
 import json
 import os.path as osp
 from pathlib import Path
-from typing import List
 import numpy as np
 import triton_python_backend_utils as pb_utils
+# This tokenizer is `lmdeploy/turbomind/tokenizer.py`. When an LLM is served
-class Tokenizer:
+# by triton inference server, it has to be converted first by running
-    """Tokenize prompts or de-tokenize tokens into texts.
+# `python lmdeploy/serve/turbomind/deploy.py`. Then
+# `lmdeploy/turbomind/tokenizer.py` will be copied to `tokenizer/tokenizer.py`
-    Args:
+from .tokenizer.tokenizer import Tokenizer
-        model_file (str): the path of the tokenizer model
-    """
-    def __init__(self, model_file: str):
-        model_folder = osp.split(model_file)[0]
-        tokenizer_config_file = osp.join(model_folder, 'tokenizer_config.json')
-        use_hf_model = osp.exists(tokenizer_config_file)
-        self.use_hf_model = use_hf_model
-        if not self.use_hf_model:
-            from sentencepiece import SentencePieceProcessor
-            self.model = SentencePieceProcessor(model_file=model_file)
-            self.vocab_size = self.model.vocab_size()
-            self.start_id = self.model.bos_id()
-            self.end_id = self.model.eos_id()
-        else:
-            from transformers import AutoTokenizer
-            backend_tokenizer_file = osp.join(model_folder, 'tokenizer.json')
-            if not osp.exists(backend_tokenizer_file):
-                print('WARNING: Can not find tokenizer.json. '
-                      'It may take long time to initialize the tokenizer.')
-            self.model = AutoTokenizer.from_pretrained(model_folder,
-                                                       trust_remote_code=True)
-            self.vocab_size = self.model.vocab_size
-            self.start_id = self.model.bos_token_id
-            self.end_id = self.model.eos_token_id
-            # save tokenizer.json to reuse
-            if not osp.exists(backend_tokenizer_file) and \
-                    hasattr(self.model, 'backend_tokenizer'):
-                self.model.backend_tokenizer.save(backend_tokenizer_file)
-    def encode(self, s: str):
-        """Tokenize a prompt.
-        Args:
-            s (str): a prompt
-        Returns:
-            list[int]: token ids
-        """
-        if not self.use_hf_model:
-            add_bos = False
-            add_eos = False
-            if s.find('<BOS>') != -1:
-                s = s.replace('<BOS>', '')
-                add_bos = True
-            if s == '<EOS>':
-                s = ''
-                add_eos = True
-            return self.model.Encode(s, add_bos=add_bos, add_eos=add_eos)
-        else:
-            add_special_tokens = False
-            if s.find('<BOS>') != -1:
-                s = s.replace('<BOS>', '<s>')
-            if s == '<EOS>':
-                s = '</s>'
-            if len(s) == 0:
-                add_special_tokens = True
-            return self.model.encode(s, add_special_tokens=add_special_tokens)
-    def decode(self, t: List[int]):
-        """De-tokenize.
-        Args:
-            t (List[int]): a list of token ids
-        Returns:
-            str: text of decoding tokens
-        """
-        if not self.use_hf_model:
-            return self.model.Decode(t)
-        else:
-            skip_special_tokens = False
-            return self.model.decode(t,
-                                     skip_special_tokens=skip_special_tokens)
 class TritonPythonModel:

--- a/lmdeploy/serve/turbomind/triton_models/preprocessing/1/model.py
+++ b/lmdeploy/serve/turbomind/triton_models/preprocessing/1/model.py
@@ -2,90 +2,17 @@
 import json
 import os.path as osp
 from pathlib import Path
-from typing import List
 import numpy as np
 import torch
 import triton_python_backend_utils as pb_utils
 from torch.nn.utils.rnn import pad_sequence
+# This tokenizer is `lmdeploy/turbomind/tokenizer.py`. When an LLM is served
-class Tokenizer:
+# by triton inference server, it has to be converted first by running
-    """Tokenize prompts or de-tokenize tokens into texts.
+# `python lmdeploy/serve/turbomind/deploy.py`. Then
+# `lmdeploy/turbomind/tokenizer.py` will be copied to `tokenizer/tokenizer.py`
-    Args:
+from .tokenizer.tokenizer import Tokenizer
-        model_file (str): the path of the tokenizer model
-    """
-    def __init__(self, model_file: str):
-        model_folder = osp.split(model_file)[0]
-        tokenizer_config_file = osp.join(model_folder, 'tokenizer_config.json')
-        use_hf_model = osp.exists(tokenizer_config_file)
-        self.use_hf_model = use_hf_model
-        if not self.use_hf_model:
-            from sentencepiece import SentencePieceProcessor
-            self.model = SentencePieceProcessor(model_file=model_file)
-            self.vocab_size = self.model.vocab_size()
-            self.start_id = self.model.bos_id()
-            self.end_id = self.model.eos_id()
-        else:
-            from transformers import AutoTokenizer
-            backend_tokenizer_file = osp.join(model_folder, 'tokenizer.json')
-            if not osp.exists(backend_tokenizer_file):
-                print('WARNING: Can not find tokenizer.json. '
-                      'It may take long time to initialize the tokenizer.')
-            self.model = AutoTokenizer.from_pretrained(model_folder,
-                                                       trust_remote_code=True)
-            self.vocab_size = self.model.vocab_size
-            self.start_id = self.model.bos_token_id
-            self.end_id = self.model.eos_token_id
-            # save tokenizer.json to reuse
-            if not osp.exists(backend_tokenizer_file) and \
-                    hasattr(self.model, 'backend_tokenizer'):
-                self.model.backend_tokenizer.save(backend_tokenizer_file)
-    def encode(self, s: str):
-        """Tokenize a prompt.
-        Args:
-            s (str): a prompt
-        Returns:
-            list[int]: token ids
-        """
-        if not self.use_hf_model:
-            add_bos = False
-            add_eos = False
-            if s.find('<BOS>') != -1:
-                s = s.replace('<BOS>', '')
-                add_bos = True
-            if s == '<EOS>':
-                s = ''
-                add_eos = True
-            return self.model.Encode(s, add_bos=add_bos, add_eos=add_eos)
-        else:
-            add_special_tokens = False
-            if s.find('<BOS>') != -1:
-                s = s.replace('<BOS>', '<s>')
-            if s == '<EOS>':
-                s = '</s>'
-            if len(s) == 0:
-                add_special_tokens = True
-            return self.model.encode(s, add_special_tokens=add_special_tokens)
-    def decode(self, t: List[int]):
-        """De-tokenize.
-        Args:
-            t (List[int]): a list of token ids
-        Returns:
-            str: text of decoding tokens
-        """
-        if not self.use_hf_model:
-            return self.model.Decode(t)
-        else:
-            skip_special_tokens = False
-            return self.model.decode(t,
-                                     skip_special_tokens=skip_special_tokens)
 class TritonPythonModel:
@@ -131,8 +58,8 @@ class TritonPythonModel:
            osp.join(
                cur_folder, self.model_config['parameters']['tokenizer_path']
                ['string_value']))
-        self.start_id = self.tokenizer.start_id
+        self.start_id = self.tokenizer.bos_token_id
-        self.end_id = self.tokenizer.end_id
+        self.end_id = self.tokenizer.eos_token_id
    def execute(self, requests):
        """`execute` must be implemented in every Python model. `execute`

--- a/lmdeploy/turbomind/chat.py
+++ b/lmdeploy/turbomind/chat.py
@@ -29,29 +29,24 @@ def valid_str(string, coding='utf-8'):
    return ret
-def main(model_name,
+def main(model_path, session_id: int = 1, repetition_penalty: float = 1.0):
-         model_path,
-         session_id: int = 1,
-         repetition_penalty: float = 1.0):
    """An example to perform model inference through the command line
    interface.
    Args:
-        model_name (str): the name of the deployed model
        model_path (str): the path of the deployed model
        session_id (int): the identical id of a session
    """
-    model = MODELS.get(model_name)()
    tokenizer_model_path = osp.join(model_path, 'triton_models', 'tokenizer')
    tokenizer = Tokenizer(tokenizer_model_path)
-    tm_model = tm.TurboMind(model_path,
+    tm_model = tm.TurboMind(model_path, eos_id=tokenizer.eos_token_id)
-                            eos_id=tokenizer.eos_token_id,
-                            stop_words=model.stop_words)
    generator = tm_model.create_instance()
    nth_round = 1
    step = 0
    seed = random.getrandbits(64)
+    model_name = tm_model.model_name
+    model = MODELS.get(model_name)()
    while True:
        prompt = input_prompt()

--- a/lmdeploy/turbomind/turbomind.py
+++ b/lmdeploy/turbomind/turbomind.py
@@ -12,6 +12,7 @@ import torch
 from torch.nn.utils.rnn import pad_sequence
 import lmdeploy
+from lmdeploy.model import MODELS
 # TODO: find another way import _turbomind
 lmdeploy_dir = osp.split(lmdeploy.__file__)[0]
@@ -70,14 +71,12 @@ class TurboMind:
        model_path (str): the path of turbomind's model
        data_type (str): the data type
        eos_id (int): eos token id
-        stop_words (List[int]): token ids of stop-words
    """
    def __init__(self,
                 model_path: str,
                 data_type: str = 'fp16',
-                 eos_id: int = 2,
+                 eos_id: int = 2):
-                 stop_words: List[int] = None):
        self.eos_id = eos_id
        # TODO: support mpi
@@ -101,6 +100,9 @@ class TurboMind:
                self.gpu_count = parser.getint(section_name,
                                               'tensor_para_size')
                self.session_len = parser.getint(section_name, 'session_len')
+            self.model_name = parser.get(section_name, 'model_name')
+        model = MODELS.get(self.model_name)()
+        self.stop_words = _stop_words(model.stop_words)
        # params
        self.node_id = node_id
@@ -129,8 +131,6 @@ class TurboMind:
        for t in threads:
            t.join()
-        self.stop_words = _stop_words(stop_words)
    def create_instance(self, cuda_stream_id=0):
        """Create a turbomind instance.