0.2.6版本新增文件补充

fe851fbc · zhouxiang · e2d98ddc · fe851fbc · fe851fbc · fe851fbc
Commit fe851fbc authored Mar 24, 2024 by zhouxiang
20 changed files
--- a/docs/zh_cn/serving/proxy_server.md
+++ b/docs/zh_cn/serving/proxy_server.md
+# 请求分发服务器
+请求分发服务可以将多个 api_server 服务，进行并联。用户可以只需要访问代理 URL，就可以间接访问不同的 api_server 服务。代理服务内部会自动分发请求，做到负载均衡。
+## 启动
+启动代理服务：
+```shell
+python3 -m lmdeploy.serve.proxy.proxy --server_name {server_name} --server_port {server_port} --strategy "min_expected_latency"
+```
+启动成功后，代理服务的 URL 也会被脚本打印。浏览器访问这个 URL，可以打开 Swagger UI。
+## API
+通过 Swagger UI，我们可以看到多个 API。其中，和 api_server 节点管理相关的有：
+- /nodes/status
+- /nodes/add
+- /nodes/remove
+他们分别表示，查看所有的 api_server 服务节点，增加某个节点，删除某个节点。
+和使用相关的 api 有：
+- /v1/models
+- /v1/chat/completions
+- /v1/completions
+这些 API 的使用方式和 api_server 一样。
+## 分发策略
+代理服务目前的分发策略如下：
+- random： 根据用户提供的各个 api_server 节点的处理请求的能力，进行有权重的随机。处理请求的吞吐量越大，就越有可能被分配。部分节点没有提供吞吐量，将按照其他节点的平均吞吐量对待。
+- min_expected_latency： 根据每个节点现有的待处理完的请求，和各个节点吞吐能力，计算预期完成响应所需时间，时间最短的将被分配。未提供吞吐量的节点，同上。
+- min_observed_latency： 根据每个节点过去一定数量的请求，处理完成所需的平均用时，用时最短的将被分配。
--- a/docs/zh_cn/serving/qos.md
+++ b/docs/zh_cn/serving/qos.md
+## LMDeploy-QoS 介绍与用法
+### 背景
+在过去一段时间，推理框架伴随着LLM和AGI出现。许多推理框架为语言模型提供可扩展和高性能的在线工作负载服务。它们的工作负载通常涉及多个用户群体，而且工作负载在短时间内快速变化。许多推理框架在满足这些多租户流量模式的要求方面存在困难，而且未能很好的规范约束用户的行为，所以我们认为在LLM推理框架考虑多用户负载均衡是很有必要的。
+### 多租户处理的用户分类
+LMDeploy-QoS与LMDeploy 提供一系列多租户功能。它要求用户使用适当的用户标识(配置文件或代码库中的user_id)标记其推理请求。它是基于字典的配置作为多租户策略。在这个配置中，用户被映射到不同“用户组”中，并配备一个使用配额。我们的多租户策略可以读取配置，并根据其用户组的优先级和预定义配额与实时分配比率之间的差异安排用户推理请求的调度。经过完备的测试，我们的LMDeploy-QoS模块极大地提高了LLM的服务可靠性并提升了大型语言模型推理工作的GPU资源利用率。
+LMDeploy将用户分为4组：
+- 白金（Platinum）
+- 金（Gold）
+- 银（Silver）
+- 青铜（Bronze）
+根据我们在提供LLM服务方面的使用经验，我们可以将以下4种类型的用户映射到这些用户组中：
+- Platinum : VIP用户或管理员用户。包括需要不间断使用的的服务开发人员或演示人员。他们的工作负载频率低，对推理工作的资源需求也不高。
+- Gold : 签署定期服务的高级用户，他们需要可衡量的可靠服务。例如，某个公司A与LLM服务提供商签订了合同，购买了每秒X个请求的服务能力，可用性为Z%，供A公司员工使用，年付Y百万美元。
+- Silver : 绝大多数用户。大多数试用或每月订阅的用户被归类为此类别。他们需要相对较少的服务，但他们的用户体验对于LLM服务的声誉也很重要。
+- Bronze : 支付很少费用给LLM提供商的重度用户。
+以上引入用户组分类的目的是为了提供指导，而不是为所有LMDeploy用户提供建议，因为这并不一定适用于所有LLM业务提供商。管理员可以对用户的日常负载进行统计，自行决定如何对用户进行分类。
+接下来让我们讨论一下LMDeploy如何根据这些分类进行分配请求。
+### 多租户策略
+#### 策略 1: 用户组之间的优先级调度
+我们引入“用户组”概念。由模块使用者来定义哪些用户到用户组的映射（可以理解为 uid 到用户组的映射）。推荐用户组为4组如下：
+- Platinum
+- Gold
+- Silver
+- Bronze
+四个用户组之间的优先级顺序是严格的 Platinum > Gold > Silver > Bronze 。当系统繁忙的时候，我们会优先执行排名靠前的请求。
+下面的图表显示了优先级处理的工作原理。您可以看到 Platinum 请求已被重新设置优先级并移至队列头部。
+![](https://github.com/InternLM/lmdeploy/assets/52888924/9d63f081-7168-4c74-8456-24f0a4b41649)
+#### 策略 2: 用户组内均摊与软隔离
+这个策略仅适用于用户组内部。我们引入了一个用户组内的用户配额配置表。该表定义了用户在 100% GPU 资源中的 “理想份额比例”。每个 “用户” 在列表中以 user_id 的形式出现，并且一个用户只能属于一个用户组。低于配额表上额定值的用户会比高于额定值的用户拥有更高的优先级获得被释放资源而进行更多的推理，直到双方使用量趋近于原始配额比例。此处调度只考虑请求队列中的用户，忽略没有出现在请求队列中的已配置用户。
+以下图表展示了这种策略的典型示例。
+![](https://github.com/InternLM/lmdeploy/assets/52888924/3e1d7135-6b11-4998-89a1-b72af6c962c3)
+#### 策略3：混合机制
+是指在一个系统中优先级+均摊/隔离同时开启。执行顺序是先用户组间优先级，再在组内做均摊/隔离实现。这里略去时序图描写。需要注意的是，用户组间的优先级可以压倒性覆盖组内的决策。例如，当低优先级内部的两个用户互相之间有请求顺序调度时，高优先级的请求一旦抵达，将会覆盖所有低优先级的分配逻辑而有限执行高优任务。
+![](https://github.com/InternLM/lmdeploy/assets/52888924/e335f976-ff15-48db-b1ff-abf1c3327d6e)
+需要注意的是，混合机制可能有其他方法，本文档只介绍了一种在我们场景下有效的方法。其他混合方法需要考虑到优先级和按比例共享明显是相互冲突的策略，因此没有简单的方法将它们混合在单一维度内工作。
+### QoS 配置项模板
+配置文件通过启动参数`--qos-config-path`指定，并由程序在启动时加载。
+配置会和lmdeploy启动脚本等文件放置在一起。配置内容包含：
+1. QoS的启用开关，设置为True时后续的QoS和用户相关配置才会生效，设置为False后续配置不会生效；
+2. user_groups 是一个列表，包含了多种不同的组间优先级；
+3. user_group_map 的映射配置，包含了用户组优先级，组内用户id以及每个用户组内用户的配额分配。
+配置项模板如下：
+```json
+{
+    "enable_user_qos": true,
+    "user_groups": [
+        "Platinum",
+        "Gold",
+        "Silver",
+        "Bronze"
+    ],
+    "user_group_map": {
+        "Platinum": [
+            {
+                "id": "user_id0",
+                "quota_pct": 100
+            },
+            {
+                "id": "default",
+                "quota_pct": 0
+            }
+        ],
+        "Gold": [
+            {
+                "id": "user_id1",
+                "quota_pct": 50
+            },
+            {
+                "id": "user_id2",
+                "quota_pct": 50
+            }
+        ],
+        "Silver": [
+            {
+                "id": "user_id3",
+                "quota_pct": 5
+            },
+            {
+                "id": "default",
+                "quota_pct": 95
+            }
+        ],
+        "Bronze": [
+            {
+                "id": "user_id4",
+                "quota_pct": 30
+            },
+            {
+                "id": "user_id5",
+                "quota_pct": 30
+            },
+            {
+                "id": "user_id6",
+                "quota_pct": 40
+            },
+            {
+                "id": "default",
+                "quota_pct": 0
+            }
+        ]
+    }
+}
+```
+### 如何使用 LMDeploy-QoS 感知进行推理
+我们提供以下代码链接，展示如何调用具有多租户策略感知的推理请求，在 HTTP Body 中，与 QoS 相关的参数如下：
+/v1/chat/interactive_qos
+```bash
+curl -X POST http://localhost/v1/chat/interactive_qos \
+  -H "Content-Type: application/json" \
+  -d '{
+  "prompt": "Hello,Hello",
+  "session_id": -1,
+  "interactive_mode": false,
+  "stream": false,
+  "stop": false,
+  "request_output_len": 512,
+  "top_p": 0.8,
+  "top_k": 40,
+  "temperature": 0.8,
+  "repetition_penalty": 1,
+  "ignore_eos": false,
+  "user_id": "user_id0"
+}'
+```
+/v1/chat/completions_qos
+```bash
+curl -X POST http://localhost/v1/chat/completions_qos \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": "internlm-chat-7b",
+  "messages": "Hello,Hello",
+  "temperature": 0.7,
+  "top_p": 1,
+  "n": 1,
+  "max_tokens": 512,
+  "stop": false,
+  "stream": false,
+  "presence_penalty": 0,
+  "frequency_penalty": 0,
+  "repetition_penalty": 1,
+  "session_id": -1,
+  "ignore_eos": false,
+  "user_id": "user_id0"
+}'
+```
+/v1/completions_qos
+```bash
+curl -X POST http://localhost/v1/completions_qos \
+  -H "Content-Type: application/json" \
+  -d '{
+  "model": "internlm-chat-7b",
+  "prompt": "Hello,Hello",
+  "suffix": "string",
+  "temperature": 0.7,
+  "n": 1,
+  "max_tokens": 16,
+  "stop": "string",
+  "stream": false,
+  "top_p": 1,
+  "repetition_penalty": 1,
+  "session_id": -1,
+  "ignore_eos": false,
+  "user_id": "user_id0"
+}'
+```
+### 配置文件修改
+配置文件模板路径为：`lmdeploy/server/qos_engine/qos_config.json.template`，可以根据实际需求添加需要配置的用户，设置正确的优先级以及quota值。
+### 配置参数传入
+启动api_server时，通过`--qos-config-path`，将配置文件及路径传入，示例如下：
+```bash
+CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server internlm/internlm-chat-7b --server-port 8000 --qos-config-path lmdeploy/serve/qos_engine/qos_config.json.template
+```
+### 贡献者
+[Eric](https://github.com/rhinouser0), [sallyjunjun](https://github.com/sallyjunjun), [sfireworks](https://github.com/sfireworks), [Dofgal](https://github.com/Dofgal), [shadow](https://github.com/awslshadowstar)
--- a/docs/zh_cn/supported_models/supported_models.md
+++ b/docs/zh_cn/supported_models/supported_models.md
+# 支持的模型
+## TurboMind 支持的模型
+|        模型        | 模型规模 | FP16/BF16 | KV INT8 | W4A16 |
+| :----------------: | :------: | :-------: | :-----: | :---: |
+|       Llama        | 7B - 65B |    Yes    |   Yes   |  Yes  |
+|       Llama2       | 7B - 70B |    Yes    |   Yes   |  Yes  |
+|      InternLM      | 7B - 20B |    Yes    |   Yes   |  Yes  |
+|     InternLM2      | 7B - 20B |    Yes    |    -    |  Yes  |
+| InternLM-XComposer |    7B    |    Yes    |   Yes   |  Yes  |
+|        QWen        | 7B - 72B |    Yes    |   Yes   |  Yes  |
+|      QWen-VL       |    7B    |    Yes    |   Yes   |  Yes  |
+|      Baichuan      |    7B    |    Yes    |   Yes   |  Yes  |
+|     Baichuan2      |    7B    |    Yes    |   Yes   |  Yes  |
+|     Code Llama     | 7B - 34B |    Yes    |   No    |  No   |
+|         YI         | 6B - 34B |    Yes    |   No    |  No   |
+### PyTorch 支持的模型
+|     模型     | 模型规模  | FP16/BF16 | KV INT8 | W8A8 |
+| :----------: | :-------: | :-------: | :-----: | :--: |
+|    Llama     | 7B - 65B  |    Yes    |   No    | Yes  |
+|    Llama2    | 7B - 70B  |    Yes    |   No    | Yes  |
+|   InternLM   | 7B - 20B  |    Yes    |   No    | Yes  |
+|  InternLM2   | 7B - 20B  |    Yes    |   No    |  -   |
+|  Baichuan2   | 7B - 13B  |    Yes    |   No    | Yes  |
+|   ChatGLM2   |    6B     |    Yes    |   No    |  No  |
+|    Falcon    | 7B - 180B |    Yes    |   No    |  No  |
+|      YI      | 6B - 34B  |    Yes    |   No    |  No  |
+|   Mistral    |    7B     |    Yes    |   No    |  No  |
+|   Mixtral    |   8x7B    |    Yes    |   No    |  No  |
+|   QWen1.5    | 7B - 72B  |    Yes    |   No    |  No  |
+| DeepSeek-MoE |    16B    |    Yes    |   No    |  No  |
+|    Gemma     |   2B-7B   |    Yes    |   No    |  No  |
--- a/examples/cpp/llama/README.md
+++ b/examples/cpp/llama/README.md
+How to generate start_ids.csv
+```bash
+# update `model_file` path and `encode_line` content according to the actual situation
+python3 tokenizer.py --model_file /workdir/llama2_13b_chat/tokenizer.model --encode_line 'LMDeploy is a toolkit for compressing, deploying, and serving LLMs.'
+# refer to tokenizer.py for more usage scenarios
+```
--- a/examples/cpp/llama/start_ids.csv
+++ b/examples/cpp/llama/start_ids.csv
+1,365,5773,1022,2376,338,263,5780,7354,363,27122,292,29892,7246,292,29892,322,16330,365,26369,29879,29889
--- a/examples/vl/README.md
+++ b/examples/vl/README.md
+# Vision-Language Web Demo
+A chatbot demo with image input.
+## Supported Models
+- [InternLM/InternLM-XComposer](https://github.com/InternLM/InternLM-XComposer/tree/main)
+- [Qwen/Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat)
+## Quick Start
+### internlm/internlm-xcomposer-7b
+- extract llm model from huggingface model
+  ```python
+  python extract_xcomposer_llm.py
+  # the llm part will saved to internlm_model folder.
+  ```
+- lanuch the demo
+  ```python
+  python app.py --model-name internlm-xcomposer-7b --llm-ckpt internlm_model
+  ```
+### Qwen-VL-Chat
+- lanuch the dmeo
+  ```python
+  python app.py --model-name qwen-vl-chat --hf-ckpt Qwen/Qwen-VL-Chat
+  ```
+## Limitations
+- this demo uses the code in their repo to extract image features that might not very efficiency.
+- this demo only contains the chat function. If you want to use localization ability in Qwen-VL-Chat or article generation function in InternLM-XComposer, you need implement these pre/post processes. The difference compared to chat is how to build prompts and use the output of model.
--- a/examples/vl/app.py
+++ b/examples/vl/app.py
+import argparse
+import os
+import random
+from contextlib import contextmanager
+from dataclasses import dataclass, field
+from itertools import count
+from pathlib import Path
+from threading import Lock
+from typing import List, Tuple
+import gradio as gr
+from packaging.version import Version, parse
+from qwen_model import QwenVLChat
+from xcomposer_model import InternLMXComposer
+from lmdeploy.serve.gradio.constants import CSS, THEME, disable_btn, enable_btn
+from lmdeploy.turbomind import TurboMind
+from lmdeploy.turbomind.chat import valid_str
+BATCH_SIZE = 32
+DEFAULT_MODEL_NAME = 'internlm-xcomposer-7b'
+DEFAULT_HF_CKPT = 'internlm/internlm-xcomposer-7b'
+# should use extract_xcomposer_llm.py to extract llm
+# when use internlm-xcomposer-7b
+DEFAULT_LLM_CKPT = None
+SUPPORTED_MODELS = {
+    'internlm-xcomposer-7b': InternLMXComposer,
+    'qwen-vl-chat': QwenVLChat
+}
+if parse(gr.__version__) >= Version('4.0.0'):
+    que_kwargs = {'default_concurrency_limit': BATCH_SIZE}
+else:
+    que_kwargs = {'concurrency_count': BATCH_SIZE}
+@dataclass
+class Session:
+    _lock = Lock()
+    _count = count()
+    _session_id: int = None
+    _message: List[Tuple[str, str]] = field(default_factory=list)
+    _step: int = 0
+    def __init__(self):
+        with Session._lock:
+            self._session_id = next(Session._count)
+        self._message = []
+        self._step = 0
+    @property
+    def session_id(self):
+        return self._session_id
+    @property
+    def message(self):
+        return self._message
+    @property
+    def step(self):
+        return self._step
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--model-name',
+                        type=str,
+                        default=DEFAULT_MODEL_NAME,
+                        help='Model name, default to %(default)s')
+    parser.add_argument(
+        '--hf-ckpt',
+        type=str,
+        default=DEFAULT_HF_CKPT,
+        help='hf checkpoint name or path, default to %(default)s')
+    parser.add_argument(
+        '--llm-ckpt',
+        type=str,
+        default=DEFAULT_LLM_CKPT,
+        help='LLM checkpoint name or path, default to %(default)s')
+    parser.add_argument('--server-port',
+                        type=int,
+                        default=9006,
+                        help='Server port, default %(default)s')
+    parser.add_argument('--server-name',
+                        type=str,
+                        default='0.0.0.0',
+                        help='Server name, default %(default)s')
+    args = parser.parse_args()
+    return args
+@contextmanager
+def get_stop_words():
+    from lmdeploy.tokenizer import Tokenizer
+    old_func = Tokenizer.indexes_containing_token
+    def new_func(self, token):
+        indexes = self.encode(token, add_bos=False)
+        return indexes
+    Tokenizer.indexes_containing_token = new_func
+    yield
+    Tokenizer.indexes_containing_token = old_func
+def load_preprocessor_model(args):
+    """Load preprocessor and llm inference engine."""
+    assert args.model_name in SUPPORTED_MODELS
+    llm_ckpt = args.hf_ckpt if args.llm_ckpt is None else args.llm_ckpt
+    preprocessor = SUPPORTED_MODELS[args.model_name](args.hf_ckpt)
+    with get_stop_words():
+        model = TurboMind.from_pretrained(llm_ckpt, model_name=args.model_name)
+    return preprocessor, model
+def launch_demo(args, preprocessor, model):
+    def add_image(chatbot, session, file):
+        """Append image to query."""
+        chatbot = chatbot + [((file.name, ), None)]
+        history = session._message
+        # [([user, url, url], assistant), ...]
+        if len(history) == 0 or history[-1][-1] is not None:
+            history.append([[file.name], None])
+        else:
+            history[-1][0].append(file.name)
+        return chatbot, session
+    def add_text(chatbot, session, text):
+        """User query."""
+        chatbot = chatbot + [(text, None)]
+        history = session._message
+        if len(history) == 0 or history[-1][-1] is not None:
+            history.append([text, None])
+        else:
+            history[-1][0].insert(0, text)
+        return chatbot, session, disable_btn, enable_btn
+    def chat(
+        chatbot,
+        session,
+        request_output_len=512,
+    ):
+        """Chat with AI assistant."""
+        generator = model.create_instance()
+        history = session._message
+        sequence_start = len(history) == 1
+        seed = random.getrandbits(64) if sequence_start else None
+        input_ids, features, ranges = preprocessor.prepare_query(
+            history[-1][0], sequence_start)
+        if len(input_ids
+               ) + session.step + request_output_len > model.model.session_len:
+            gr.Warning('WARNING: exceed session max length.'
+                       ' Please restart the session by reset button.')
+            yield chatbot, session, enable_btn, disable_btn, enable_btn
+        else:
+            response_size = 0
+            step = session.step
+            for outputs in generator.stream_infer(
+                    session_id=session.session_id,
+                    input_ids=input_ids,
+                    input_embeddings=features,
+                    input_embedding_ranges=ranges,
+                    request_output_len=request_output_len,
+                    stream_output=True,
+                    sequence_start=sequence_start,
+                    random_seed=seed,
+                    step=step):
+                res, tokens = outputs[0]
+                # decode res
+                response = model.tokenizer.decode(res.tolist(),
+                                                  offset=response_size)
+                if response.endswith('�'):
+                    continue
+                response = valid_str(response)
+                response_size = tokens
+                if chatbot[-1][1] is None:
+                    chatbot[-1][1] = ''
+                    history[-1][1] = ''
+                chatbot[-1][1] += response
+                history[-1][1] += response
+                session._step = step + len(input_ids) + tokens
+                yield chatbot, session, disable_btn, enable_btn, disable_btn
+            yield chatbot, session, enable_btn, disable_btn, enable_btn
+    def stop(session):
+        """Stop the session."""
+        generator = model.create_instance()
+        for _ in generator.stream_infer(session_id=session.session_id,
+                                        input_ids=[0],
+                                        request_output_len=0,
+                                        sequence_start=False,
+                                        sequence_end=False,
+                                        stop=True):
+            pass
+    def cancel(chatbot, session):
+        """Stop the session and keey chat history."""
+        stop(session)
+        return chatbot, session, disable_btn, enable_btn, enable_btn
+    def reset(session):
+        """Reset a new session."""
+        stop(session)
+        session._step = 0
+        session._message = []
+        return [], session, enable_btn
+    with gr.Blocks(css=CSS, theme=THEME) as demo:
+        with gr.Column(elem_id='container'):
+            gr.Markdown('## LMDeploy VL Playground')
+            chatbot = gr.Chatbot(elem_id='chatbot', label=model.model_name)
+            query = gr.Textbox(placeholder='Please input the instruction',
+                               label='Instruction')
+            session = gr.State()
+            with gr.Row():
+                addimg_btn = gr.UploadButton('Upload Image',
+                                             file_types=['image'])
+                cancel_btn = gr.Button(value='Cancel', interactive=False)
+                reset_btn = gr.Button(value='Reset')
+        addimg_btn.upload(add_image, [chatbot, session, addimg_btn],
+                          [chatbot, session],
+                          show_progress=True,
+                          queue=True)
+        send_event = query.submit(
+            add_text, [chatbot, session, query], [chatbot, session]).then(
+                chat, [chatbot, session],
+                [chatbot, session, query, cancel_btn, reset_btn])
+        query.submit(lambda: gr.update(value=''), None, [query])
+        cancel_btn.click(cancel, [chatbot, session],
+                         [chatbot, session, cancel_btn, reset_btn, query],
+                         cancels=[send_event])
+        reset_btn.click(reset, [session], [chatbot, session, query],
+                        cancels=[send_event])
+        demo.load(lambda: Session(), inputs=None, outputs=[session])
+    demo.queue(api_open=True, **que_kwargs, max_size=100)
+    demo.launch(
+        share=True,
+        server_port=args.server_port,
+        server_name=args.server_name,
+    )
+def main():
+    args = parse_args()
+    cur_folder = Path(__file__).parent.as_posix()
+    if cur_folder != os.getcwd():
+        os.chdir(cur_folder)
+        print(f'change working dir to {cur_folder}')
+    preprocessor, model = load_preprocessor_model(args)
+    launch_demo(args, preprocessor, model)
+if __name__ == '__main__':
+    main()
--- a/examples/vl/extract_xcomposer_llm.py
+++ b/examples/vl/extract_xcomposer_llm.py
+import os
+from pathlib import Path
+import torch
+from transformers import AutoModel, AutoTokenizer
+from xcomposer_model import InternLMXComposerTemplate  # noqa
+model = AutoModel.from_pretrained('internlm/internlm-xcomposer-7b',
+                                  trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer-7b',
+                                          trust_remote_code=True)
+internlm_model = model.internlm_model
+lora_layers = [
+    'self_attn.q_proj', 'self_attn.v_proj', 'mlp.down_proj', 'mlp.up_proj'
+]
+def get_attr(m, key):
+    keys = key.split('.')
+    for key in keys:
+        m = getattr(m, key)
+    return m
+# merge lora
+for i in range(len(internlm_model.model.layers)):
+    layer = internlm_model.model.layers[i]
+    for key in lora_layers:
+        lora_linear = get_attr(layer, key)
+        lora_b = lora_linear.lora_B
+        lora_a = lora_linear.lora_A
+        w_ba = torch.matmul(lora_b.weight, lora_a.weight)
+        lora_linear.weight.data += w_ba.data
+# save model
+cur_folder = Path(__file__).parent
+dst_path = os.path.join(cur_folder, 'internlm_model')
+internlm_model.save_pretrained(dst_path)
+tokenizer.save_pretrained(dst_path)
--- a/examples/vl/qwen_model.py
+++ b/examples/vl/qwen_model.py
+import os
+from glob import glob
+import numpy as np
+import torch
+from accelerate import init_empty_weights
+from huggingface_hub import snapshot_download
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+from lmdeploy.model import MODELS, Qwen7BChat
+@MODELS.register_module(name='qwen-vl-chat')
+class QwenVLChatTemplate(Qwen7BChat):
+    """Qwen vl chat template."""
+    def __init__(self,
+                 session_len=8192,
+                 top_p=0.3,
+                 top_k=None,
+                 temperature=1.0,
+                 **kwargs):
+        super().__init__(**kwargs)
+        self.session_len = session_len
+        self.top_p = top_p
+        self.top_k = top_k
+        self.temperature = temperature
+    def _concat_image_info(self, prompt):
+        """Append image placeholder."""
+        if isinstance(prompt, str):
+            return prompt
+        prompt, nimg = prompt
+        res = ''
+        for i in range(nimg):
+            res += f'Picture {str(i)}:<img>placeholder</img>\n'
+        prompt = res + prompt
+        return prompt
+    def get_prompt(self, prompt, sequence_start=True):
+        """Apply chat template to prompt."""
+        prompt = self._concat_image_info(prompt)
+        return super().get_prompt(prompt, sequence_start)
+    def messages2prompt(self, messages, sequence_start=True):
+        """Apply chat template to history."""
+        if isinstance(messages, str) or isinstance(messages[0], str):
+            return self.get_prompt(messages, sequence_start)
+        box_map = dict(user=self.user,
+                       assistant=self.assistant,
+                       system=self.system)
+        eox_map = dict(user=self.eoh,
+                       assistant=self.eoa + self.separator,
+                       system=self.eosys)
+        ret = ''
+        if self.meta_instruction is not None:
+            if len(messages) and messages[0]['role'] != 'system':
+                ret += f'{self.system}{self.meta_instruction}{self.eosys}'
+        for message in messages:
+            role = message['role']
+            content = message['content']
+            if role == 'user' and not isinstance(content, str):
+                content = [content[0]['text'], len(content) - 1]
+                content = self._concat_image_info(content)
+            ret += f'{box_map[role]}{content}{eox_map[role]}'
+        ret += f'{self.assistant}'
+        return ret
+class QwenVLChat:
+    """Qwen vl preprocessor to prepare the inputs for a model."""
+    def __init__(self, pretrained_model_name_or_path, **kwargs):
+        self.pretrained_model_name_or_path = pretrained_model_name_or_path
+        self.decorator = QwenVLChatTemplate(**kwargs)
+        self._load_model()
+    def _load_model(self):
+        path = self.pretrained_model_name_or_path
+        if not os.path.exists(path):
+            path = snapshot_download(path)
+        self.tokenizer = AutoTokenizer.from_pretrained(path,
+                                                       trust_remote_code=True)
+        with init_empty_weights():
+            config = AutoConfig.from_pretrained(path, trust_remote_code=True)
+            model = AutoModelForCausalLM.from_config(config,
+                                                     trust_remote_code=True)
+            del model.lm_head
+            for key in ['wte', 'h', 'ln_f']:
+                setattr(model.transformer, key, None)
+            model.to_empty(device='cpu')
+            named_parameters = set()
+            for key, _ in model.named_parameters():
+                named_parameters.add(key)
+            # TODO: load bin according to index.json
+            bins = glob(os.path.join(path, '*.bin'))
+            for bin in bins:
+                dt = torch.load(bin, map_location='cpu')
+                missed, _ = model.load_state_dict(dt, strict=False)
+                named_parameters.difference_update(set(missed))
+            assert len(
+                named_parameters) == 0, f'missing keys: {named_parameters}'
+            self.model = model.to('cuda').eval()
+    @torch.no_grad()
+    def encode_img(self, paths):
+        """Extract image features."""
+        if len(paths) == 0:
+            return None
+        features = []
+        # with torch.cuda.amp.autocast(dtype=torch.float16):
+        features = self.model.transformer.visual.encode(paths).float()
+        features = [x.cpu().numpy() for x in features]
+        return features
+    def _to_inputs(self, decorate_text, image_paths, sequence_start):
+        features = self.encode_img(image_paths)
+        input_ids = self.tokenizer.encode(decorate_text)
+        ranges = None
+        if features is not None:
+            input_ids_arr = np.array(input_ids)
+            begins = np.where(
+                input_ids_arr == self.tokenizer.img_start_id)[0] + 1
+            ends = np.where(input_ids_arr == self.tokenizer.img_end_id)[0]
+            ranges = np.stack([begins, ends], axis=1)
+            assert len(features) == len(ranges)
+        return input_ids, features, ranges
+    def prepare_query(self, query, sequence_start=True):
+        """Convert query to input_ids, features and the ranges of features to
+        input_ids."""
+        image_paths = []
+        if not isinstance(query, str):
+            query, image_paths = query[0], query[1:]
+        decorate_text = self.decorator.get_prompt((query, len(image_paths)),
+                                                  sequence_start)
+        return self._to_inputs(decorate_text, image_paths, sequence_start)
+    def prepare_message(self, messages):
+        """Convert messages to input_ids, features and the ranges of features
+        to input_ids."""
+        decorate_text = self.decorator.messages2prompt(messages, True)
+        image_paths = []
+        for msg in messages:
+            if msg['role'] == 'user':
+                content = msg['content']
+                if isinstance(content, str):
+                    continue
+                for item in content:
+                    if item['type'] == 'image_url':
+                        url = item['image_url']['url']
+                        image_paths.append(url)
+        return self._to_inputs(decorate_text, image_paths, True)
--- a/examples/vl/xcomposer_model.py
+++ b/examples/vl/xcomposer_model.py
+import os
+# from safetensors.torch import load_file
+from collections.abc import Sequence
+from glob import glob
+import numpy as np
+import torch
+from accelerate import init_empty_weights
+from huggingface_hub import snapshot_download
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
+from lmdeploy.model import MODELS, BaseChatTemplate
+meta_instruction = """meta instruction
+You are an AI assistant whose name is 浦语.
+- 浦语 is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
+- 浦语 can understand and communicate fluently in the language chosen by the user such as English and 中文.
+conversation
+"""  # noqa
+@MODELS.register_module(name='internlm-xcomposer-7b')
+class InternLMXComposerTemplate(BaseChatTemplate):
+    """Internlm xcomposer chat template."""
+    def __init__(self,
+                 meta_instruction=meta_instruction,
+                 user=' <|User|>: ',
+                 assistant=' <|Bot|>: ',
+                 eoh='<TOKENS_UNUSED_0>',
+                 eoa='<TOKENS_UNUSED_1>',
+                 stop_words=['<TOKENS_UNUSED_0>', '<TOKENS_UNUSED_1>'],
+                 image_placeholder='<Img><ImageHere></Img>',
+                 **kwargs):
+        super().__init__(**kwargs)
+        self.meta_instruction = meta_instruction
+        self.user = user
+        self.assistant = assistant
+        self.eoh = eoh
+        self.eoa = eoa
+        self.stop_words = stop_words
+        self.image_placeholder = image_placeholder
+    def _concat_image_info(self, prompt):
+        """Append image placeholder."""
+        if isinstance(prompt, str):
+            return prompt
+        prompt, nimg = prompt
+        assert nimg <= 1
+        if nimg == 1:
+            prompt = f'{self.image_placeholder}{prompt}'
+        return prompt
+    def get_prompt(self, prompt, sequence_start=True):
+        """Apply chat template to prompt."""
+        prompt = self._concat_image_info(prompt)
+        return super().get_prompt(prompt, sequence_start)
+    def messages2prompt(self, messages, sequence_start=True):
+        """Apply chat template to history."""
+        if isinstance(messages, str) or isinstance(messages[0], str):
+            return self.get_prompt(messages, sequence_start)
+        box_map = dict(user=self.user,
+                       assistant=self.assistant,
+                       system=self.system)
+        eox_map = dict(user=self.eoh,
+                       assistant=self.eoa + self.separator,
+                       system=self.eosys)
+        ret = ''
+        if self.meta_instruction is not None:
+            if len(messages) and messages[0]['role'] != 'system':
+                ret += f'{self.system}{self.meta_instruction}{self.eosys}'
+        for message in messages:
+            role = message['role']
+            content = message['content']
+            if role == 'user' and not isinstance(content, str):
+                assert isinstance(content, Sequence)
+                assert all(isinstance(item, dict) for item in content)
+                content = [content[0]['text'], len(content) - 1]
+            content = self._concat_image_info(content)
+            ret += f'{box_map[role]}{content}{eox_map[role]}'
+        ret += f'{self.assistant}'
+        return ret
+class InternLMXComposer:
+    """Internlm-xcomposer preprocessor to prepare the inputs for a model."""
+    def __init__(self, pretrained_model_name_or_path, **kwargs):
+        self.pretrained_model_name_or_path = pretrained_model_name_or_path
+        self.decorator = InternLMXComposerTemplate(**kwargs)
+        self._load_model()
+    def _load_model(self):
+        path = self.pretrained_model_name_or_path
+        if not os.path.exists(path):
+            path = snapshot_download(path)
+        self.tokenizer = AutoTokenizer.from_pretrained(path,
+                                                       trust_remote_code=True)
+        with init_empty_weights():
+            config = AutoConfig.from_pretrained(path, trust_remote_code=True)
+            config.num_hidden_layers = 0  # speedup
+            model = AutoModelForCausalLM.from_config(config,
+                                                     trust_remote_code=True)
+            model.internlm_model = None
+            model.to_empty(device='cpu')
+            named_parameters = set()
+            for key, _ in model.named_parameters():
+                named_parameters.add(key)
+            # TODO: load bin according to index.json
+            bins = glob(os.path.join(path, '*.bin'))
+            # bins = glob(os.path.join(path, '*.safetensors'))
+            for bin in bins:
+                dt = torch.load(bin, map_location='cpu')
+                # dt = load_file(bin)
+                missed, _ = model.load_state_dict(dt, strict=False)
+                named_parameters.difference_update(set(missed))
+            assert len(
+                named_parameters) == 0, f'missing keys: {named_parameters}'
+            self.model = model.to('cuda').eval()
+    @torch.no_grad()
+    def encode_img(self, paths):
+        """Extract image features."""
+        if len(paths) == 0:
+            return None
+        features = []
+        with torch.cuda.amp.autocast(dtype=torch.float16):
+            for path in paths:
+                out = self.model.encode_img(path)
+                features.append(out.squeeze().cpu().numpy())
+        return features
+    def _to_inputs(self, decorate_text, image_paths, sequence_start):
+        features = self.encode_img(image_paths)
+        input_ids = []
+        ranges = None
+        begins = []
+        segs = decorate_text.split(self.decorator.image_placeholder)
+        image_dim = features[-1].shape[0] if features is not None else 0
+        for i, seg in enumerate(segs):
+            if i > 0:
+                begins.append(len(input_ids))
+                input_ids.extend([0] * image_dim)
+            seg_ids = self.tokenizer.encode(
+                seg, add_special_tokens=((i == 0) and sequence_start))
+            input_ids.extend(seg_ids)
+        if features is not None:
+            ends = np.array(begins) + image_dim
+            ranges = np.stack([begins, ends], axis=1).tolist()
+        return input_ids, features, ranges
+    def prepare_query(self, query, sequence_start=True):
+        """Convert query to input_ids, features and the ranges of features to
+        input_ids."""
+        image_paths = []
+        if not isinstance(query, str):
+            query, image_paths = query[0], query[1:]
+            if len(image_paths) > 1:
+                print('does not support multiple images, use last one.')
+                image_paths = image_paths[-1:]
+        decorate_text = self.decorator.get_prompt((query, len(image_paths)))
+        return self._to_inputs(decorate_text, image_paths, sequence_start)
+    def prepare_message(self, messages):
+        """Convert messages to input_ids, features and the ranges of features
+        to input_ids."""
+        decorate_text = self.decorator.messages2prompt(messages, True)
+        image_paths = []
+        for msg in messages:
+            if msg['role'] == 'user':
+                content = msg['content']
+                if isinstance(content, str):
+                    continue
+                for item in content:
+                    if item['type'] == 'image_url':
+                        url = item['image_url']['url']
+                        image_paths.append(url)
+        return self._to_inputs(decorate_text, image_paths, True)
--- a/lmdeploy/__main__.py
+++ b/lmdeploy/__main__.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from .cli import run
+if __name__ == '__main__':
+    run()
--- a/lmdeploy/archs.py
+++ b/lmdeploy/archs.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import os
+from typing import Literal, Optional, Union
+from lmdeploy.serve.async_engine import AsyncEngine
+from lmdeploy.serve.vl_async_engine import VLAsyncEngine
+from lmdeploy.utils import get_hf_config_content
+from .messages import PytorchEngineConfig, TurbomindEngineConfig
+from .utils import get_logger
+SUPPORTED_TASKS = {'llm': AsyncEngine, 'vlm': VLAsyncEngine}
+logger = get_logger('lmdeploy')
+def autoget_backend(model_path: str) -> Union[Literal['turbomind', 'pytorch']]:
+    """Get backend type in auto backend mode.
+    Args:
+         model_path (str): the path of a model.
+            It could be one of the following options:
+                - i) A local directory path of a turbomind model which is
+                    converted by `lmdeploy convert` command or download from
+                    ii) and iii).
+                - ii) The model_id of a lmdeploy-quantized model hosted
+                    inside a model repo on huggingface.co, such as
+                    "InternLM/internlm-chat-20b-4bit",
+                    "lmdeploy/llama2-chat-70b-4bit", etc.
+                - iii) The model_id of a model hosted inside a model repo
+                    on huggingface.co, such as "internlm/internlm-chat-7b",
+                    "Qwen/Qwen-7B-Chat ", "baichuan-inc/Baichuan2-7B-Chat"
+                    and so on.
+    Returns:
+        str: the backend type.
+    """
+    from lmdeploy.pytorch.supported_models import \
+        is_supported as is_supported_pytorch
+    pytorch_has, turbomind_has = False, False
+    try:
+        from lmdeploy.turbomind.supported_models import \
+            is_supported as is_supported_turbomind
+        turbomind_has = is_supported_turbomind(model_path)
+    except ImportError:
+        logger.warning(
+            'Lmdeploy with turbomind engine is not installed correctly. '
+            'You may need to install lmdeploy from pypi or build from source '
+            'for turbomind engine.')
+    pytorch_has = is_supported_pytorch(model_path)
+    if not (pytorch_has or turbomind_has):
+        logger.warning(f'{model_path} is not explicitly supported by lmdeploy.'
+                       f' Try to run with lmdeploy pytorch engine.')
+    backend = 'turbomind' if turbomind_has else 'pytorch'
+    return backend
+def autoget_backend_config(
+    model_path: str,
+    backend_config: Optional[Union[PytorchEngineConfig,
+                                   TurbomindEngineConfig]] = None
+) -> Union[PytorchEngineConfig, TurbomindEngineConfig]:
+    """Get backend config automatically.
+    Args:
+        model_path (str): The input model path.
+        backend_config (TurbomindEngineConfig | PytorchEngineConfig): The
+            input backend config. Default to None.
+    Returns:
+        (PytorchEngineConfig | TurbomindEngineConfig): The auto-determined
+            backend engine config.
+    """
+    from dataclasses import asdict
+    backend = autoget_backend(model_path)
+    if backend == 'pytorch':
+        config = PytorchEngineConfig()
+    else:
+        config = TurbomindEngineConfig()
+    if backend_config is not None:
+        data = asdict(backend_config)
+        for k, v in data.items():
+            if v and hasattr(config, k):
+                setattr(config, k, v)
+    return config
+def check_vl_llm(config: dict) -> bool:
+    """check if the model is a vl model from model config."""
+    arch = config['architectures'][0]
+    if arch == 'LlavaLlamaForCausalLM':
+        return True
+    elif arch == 'QWenLMHeadModel' and 'visual' in config:
+        return True
+    return False
+def get_task(model_path: str):
+    """get pipeline type and pipeline class from model config."""
+    if os.path.exists(os.path.join(model_path, 'triton_models', 'weights')):
+        # workspace model
+        return 'llm', AsyncEngine
+    config = get_hf_config_content(model_path)
+    if check_vl_llm(config):
+        return 'vlm', VLAsyncEngine
+    # default task, pipeline_class
+    return 'llm', AsyncEngine
--- a/lmdeploy/cli/entrypoint.py
+++ b/lmdeploy/cli/entrypoint.py
+# Copyright (c) OpenMMLab. All rights reserved.
+from .chat import SubCliChat
+from .cli import CLI
+from .lite import SubCliLite
+from .serve import SubCliServe
+def run():
+    """The entry point of running LMDeploy CLI."""
+    CLI.add_parsers()
+    SubCliChat.add_parsers()
+    SubCliServe.add_parsers()
+    SubCliLite.add_parsers()
+    parser = CLI.parser
+    args = parser.parse_args()
+    if 'run' in dir(args):
+        args.run(args)
+    else:
+        try:
+            args.print_help()
+        except AttributeError:
+            command = args.command
+            if command == 'serve':
+                SubCliServe.parser.print_help()
+            elif command == 'lite':
+                SubCliLite.parser.print_help()
+            elif command == 'chat':
+                SubCliChat.parser.print_help()
+            else:
+                parser.print_help()
--- a/lmdeploy/cli/utils.py
+++ b/lmdeploy/cli/utils.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+from typing import List
+class DefaultsAndTypesHelpFormatter(argparse.HelpFormatter):
+    """Formatter to output default value and type in help information."""
+    def _get_help_string(self, action):
+        """Add default and type info into help."""
+        help = action.help
+        if '%(default)' not in action.help:
+            if action.default is not argparse.SUPPRESS:
+                defaulting_nargs = [argparse.OPTIONAL, argparse.ZERO_OR_MORE]
+                if (action.option_strings or action.nargs
+                        in defaulting_nargs) and 'default' not in help.lower():
+                    help += '. Default: %(default)s'
+                if action.type:
+                    help += '. Type: %(type)s'
+        return help
+def convert_args(args):
+    """Convert args to dict format."""
+    special_names = ['run', 'command']
+    kwargs = {
+        k[0]: k[1]
+        for k in args._get_kwargs() if k[0] not in special_names
+    }
+    return kwargs
+def get_lora_adapters(adapters: List[str]):
+    """Parse lora adapers from cli input.
+    Args:
+        adapters (List[str]): CLI input string of lora adapter path(s).
+    Returns:
+        Dict[str,str] or None: Parsed lora adapter path(s).
+    """
+    if not adapters:
+        return None
+    n = len(adapters)
+    output = {}
+    if n == 1:
+        name = 'default'
+        path = adapters[0].strip()
+        if '=' in path:
+            name, path = path.split('=', 1)
+        output[name] = path
+    else:
+        for pair in adapters:
+            assert '=' in pair, f'Multiple lora paths must in format of ' \
+                                 f'xxx=yyy. But given: {pair}'
+            name, path = pair.strip().split('=', 1)
+            assert name not in output, f'Multiple lora paths with ' \
+                                       f'repeated lora name: {name}'
+            output[name] = path
+    return output
+class ArgumentHelper:
+    """Helper class to add unified argument."""
+    @staticmethod
+    def model_name(parser):
+        """Add argument model_name to parser."""
+        return parser.add_argument(
+            '--model-name',
+            type=str,
+            default=None,
+            help='The name of the to-be-deployed model, such as'
+            ' llama-7b, llama-13b, vicuna-7b and etc. You '
+            'can run `lmdeploy list` to get the supported '
+            'model names')
+    @staticmethod
+    def model_format(parser, default: str = None):
+        return parser.add_argument(
+            '--model-format',
+            type=str,
+            default=default,
+            choices=['hf', 'llama', 'awq'],
+            help='The format of input model. `hf` meaning `hf_llama`, `llama` '
+            'meaning `meta_llama`, `awq` meaning the quantized model by awq')
+    @staticmethod
+    def tp(parser):
+        """Add argument tp to parser."""
+        return parser.add_argument(
+            '--tp',
+            type=int,
+            default=1,
+            help='GPU number used in tensor parallelism. Should be 2^n')
+    @staticmethod
+    def session_id(parser):
+        """Add argument session_id to parser."""
+        return parser.add_argument('--session-id',
+                                   type=int,
+                                   default=1,
+                                   help='The identical id of a session')
+    @staticmethod
+    def session_len(parser, default: int = None):
+        return parser.add_argument('--session-len',
+                                   type=int,
+                                   default=default,
+                                   help='The max session length of a sequence')
+    @staticmethod
+    def max_batch_size(parser):
+        """Add argument max_batch_size to parser."""
+        return parser.add_argument('--max-batch-size',
+                                   type=int,
+                                   default=128,
+                                   help='Maximum batch size')
+    @staticmethod
+    def quant_policy(parser):
+        """Add argument quant_policy to parser."""
+        return parser.add_argument('--quant-policy',
+                                   type=int,
+                                   default=0,
+                                   help='Whether to use kv int8')
+    @staticmethod
+    def rope_scaling_factor(parser):
+        """Add argument rope_scaling_factor to parser."""
+        return parser.add_argument('--rope-scaling-factor',
+                                   type=float,
+                                   default=0.0,
+                                   help='Rope scaling factor')
+    @staticmethod
+    def use_logn_attn(parser):
+        """Add argument use_logn_attn to parser."""
+        return parser.add_argument(
+            '--use-logn-attn',
+            action='store_true',
+            default=False,
+            help='Whether to use logn attention scaling')
+    @staticmethod
+    def block_size(parser):
+        """Add argument block_size to parser."""
+        return parser.add_argument('--block-size',
+                                   type=int,
+                                   default=64,
+                                   help='The block size for paging cache')
+    @staticmethod
+    def top_p(parser):
+        """Add argument top_p to parser."""
+        return parser.add_argument(
+            '--top-p',
+            type=float,
+            default=0.8,
+            help='An alternative to sampling with temperature,'
+            ' called nucleus sampling, where the model '
+            'considers the results of the tokens with '
+            'top_p probability mass')
+    @staticmethod
+    def top_k(parser):
+        """Add argument top_k to parser."""
+        return parser.add_argument(
+            '--top-k',
+            type=int,
+            default=1,
+            help='An alternative to sampling with temperature, '
+            'where the model considers the top_k tokens '
+            'with the highest probability')
+    @staticmethod
+    def temperature(parser, default: float = 0.8):
+        return parser.add_argument('-temp',
+                                   '--temperature',
+                                   type=float,
+                                   default=default,
+                                   help='Sampling temperature')
+    @staticmethod
+    def repetition_penalty(parser):
+        """Add argument repetition_penalty to parser."""
+        return parser.add_argument('--repetition-penalty',
+                                   type=float,
+                                   default=1.0,
+                                   help='Parameter to penalize repetition')
+    @staticmethod
+    def cap(parser):
+        """Add argument cap to parser."""
+        return parser.add_argument(
+            '--cap',
+            type=str,
+            default='chat',
+            choices=['completion', 'infilling', 'chat', 'python'],
+            help='The capability of a model. '
+            'Deprecated. Please use --chat-template instead')
+    @staticmethod
+    def log_level(parser):
+        """Add argument log_level to parser."""
+        import logging
+        return parser.add_argument('--log-level',
+                                   type=str,
+                                   default='ERROR',
+                                   choices=list(logging._nameToLevel.keys()),
+                                   help='Set the log level')
+    @staticmethod
+    def api_keys(parser):
+        return parser.add_argument(
+            '--api-keys',
+            type=str,
+            nargs='*',
+            default=None,
+            help='Optional list of space separated API keys',
+        )
+    @staticmethod
+    def ssl(parser):
+        return parser.add_argument(
+            '--ssl',
+            action='store_true',
+            required=False,
+            default=False,
+            help='Enable SSL. Requires OS Environment variables'
+            " 'SSL_KEYFILE' and 'SSL_CERTFILE'",
+        )
+    @staticmethod
+    def backend(parser):
+        """Add argument backend to parser."""
+        return parser.add_argument('--backend',
+                                   type=str,
+                                   default='turbomind',
+                                   choices=['pytorch', 'turbomind'],
+                                   help='Set the inference backend')
+    @staticmethod
+    def engine(parser):
+        """Add argument engine to parser."""
+        return parser.add_argument('--engine',
+                                   type=str,
+                                   default='turbomind',
+                                   choices=['pytorch', 'turbomind'],
+                                   help='Set the inference backend')
+    @staticmethod
+    def stream_output(parser):
+        """Add argument stream_output to parser."""
+        return parser.add_argument(
+            '--stream-output',
+            action='store_true',
+            help='Indicator for streaming output or not')
+    @staticmethod
+    def calib_dataset(parser):
+        """Add argument calib_dataset to parser."""
+        return parser.add_argument('--calib-dataset',
+                                   type=str,
+                                   default='ptb',
+                                   help='The calibration dataset name')
+    @staticmethod
+    def calib_samples(parser):
+        """Add argument calib_samples to parser."""
+        return parser.add_argument(
+            '--calib-samples',
+            type=int,
+            default=128,
+            help='The number of samples for calibration')
+    @staticmethod
+    def calib_seqlen(parser):
+        """Add argument calib_seqlen to parser."""
+        return parser.add_argument('--calib-seqlen',
+                                   type=int,
+                                   default=2048,
+                                   help='The sequence length for calibration')
+    @staticmethod
+    def device(parser):
+        """Add argument device to parser."""
+        return parser.add_argument('--device',
+                                   type=str,
+                                   default='cuda',
+                                   choices=['cuda', 'cpu'],
+                                   help='Device type of running')
+    @staticmethod
+    def meta_instruction(parser):
+        """Add argument meta_instruction to parser."""
+        return parser.add_argument(
+            '--meta-instruction',
+            type=str,
+            default=None,
+            help='System prompt for ChatTemplateConfig. Deprecated. '
+            'Please use --chat-template instead')
+    @staticmethod
+    def chat_template(parser):
+        """Add chat template config to parser."""
+        return parser.add_argument(
+            '--chat-template',
+            type=str,
+            default=None,
+            help=\
+            'A JSON file or string that specifies the chat template configuration. '  # noqa
+            'Please refer to https://lmdeploy.readthedocs.io/en/latest/advance/chat_template.html for the specification'  # noqa
+        )
+    @staticmethod
+    def cache_max_entry_count(parser):
+        """Add argument cache_max_entry_count to parser."""
+        return parser.add_argument(
+            '--cache-max-entry-count',
+            type=float,
+            default=0.8,
+            help='The percentage of gpu memory occupied by the k/v cache')
+    @staticmethod
+    def adapters(parser):
+        """Add argument adapters to parser."""
+        return parser.add_argument(
+            '--adapters',
+            nargs='*',
+            type=str,
+            default=None,
+            help='Used to set path(s) of lora adapter(s). One can input '
+            'key-value pairs in xxx=yyy format for multiple lora '
+            'adapters. If only have one adapter, one can only input '
+            'the path of the adapter.')
+    @staticmethod
+    def work_dir(parser):
+        """Add argument work_dir to parser."""
+        return parser.add_argument(
+            '--work-dir',
+            type=str,
+            default='./work_dir',
+            help='The working directory to save results')
--- a/lmdeploy/legacy/__init__.py
+++ b/lmdeploy/legacy/__init__.py
+# Copyright (c) OpenMMLab. All rights reserved.
--- a/lmdeploy/legacy/pytorch/__init__.py
+++ b/lmdeploy/legacy/pytorch/__init__.py
+# Copyright (c) OpenMMLab. All rights reserved.
+"""Chat with torch models."""
--- a/lmdeploy/legacy/pytorch/accel.py
+++ b/lmdeploy/legacy/pytorch/accel.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+class LoadNoInit:
+    """Initialize model without parameter initialization."""
+    def __init__(self):
+        self.constant_ = torch.nn.init.constant_
+        self.zeros_ = torch.nn.init.zeros_
+        self.ones_ = torch.nn.init.ones_
+        self.uniform_ = torch.nn.init.uniform_
+        self.normal_ = torch.nn.init.normal_
+        self.kaiming_uniform_ = torch.nn.init.kaiming_uniform_
+        self.kaiming_normal_ = torch.nn.init.kaiming_normal_
+    def __enter__(self, *args, **kwargs):
+        """Replace initializers with no-op."""
+        torch.nn.init.constant_ = lambda *args, **kwargs: None
+        torch.nn.init.zeros_ = lambda *args, **kwargs: None
+        torch.nn.init.ones_ = lambda *args, **kwargs: None
+        torch.nn.init.uniform_ = lambda *args, **kwargs: None
+        torch.nn.init.normal_ = lambda *args, **kwargs: None
+        torch.nn.init.kaiming_uniform_ = lambda *args, **kwargs: None
+        torch.nn.init.kaiming_normal_ = lambda *args, **kwargs: None
+    def __exit__(self, *args, **kwargs):
+        """Recover."""
+        torch.nn.init.constant_ = self.constant_
+        torch.nn.init.zeros_ = self.zeros_
+        torch.nn.init.ones_ = self.ones_
+        torch.nn.init.uniform_ = self.uniform_
+        torch.nn.init.normal_ = self.normal_
+        torch.nn.init.kaiming_uniform_ = self.kaiming_uniform_
+        torch.nn.init.kaiming_normal_ = self.kaiming_normal_
--- a/lmdeploy/legacy/pytorch/adapters/__init__.py
+++ b/lmdeploy/legacy/pytorch/adapters/__init__.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+from lmdeploy.utils import get_logger
+from .base import BasicAdapter, BasicAdapterFast
+from .internlm import InternLMAdapter
+from .llama2 import Llama2Adapter
+logger = get_logger(__name__)
+def _get_default_adapter(tokenizer):
+    if tokenizer.is_fast:
+        return BasicAdapterFast
+    else:
+        return BasicAdapter
+def init_adapter(model: nn.Module, tokenizer, adapter=None):
+    if adapter is None:
+        for v in model.modules():
+            if 'InternLMModel' in v.__class__.__name__:
+                Adapter = InternLMAdapter
+                break
+            elif 'LlamaModel' in v.__class__.__name__:
+                Adapter = Llama2Adapter
+                break
+        else:
+            Adapter = _get_default_adapter(tokenizer)
+    elif adapter == 'llama1':
+        Adapter = _get_default_adapter(tokenizer)
+    else:
+        raise ValueError(f'Adapter {adapter} is not allowed.')
+    logger.info(f'Using adapter {Adapter.__name__}')
+    return Adapter(tokenizer)
--- a/lmdeploy/legacy/pytorch/adapters/base.py
+++ b/lmdeploy/legacy/pytorch/adapters/base.py
+# Copyright (c) OpenMMLab. All rights reserved.
+"""Basic adapter suitable for general HuggingFace models."""
+import re
+from transformers import (PreTrainedTokenizer, PreTrainedTokenizerBase,
+                          PreTrainedTokenizerFast)
+from lmdeploy.utils import get_logger
+logger = get_logger(__name__)
+class BaseAdapter:
+    """Base class for all adapters.
+    Note:
+        Adapters coordinate with the session manager to prepare input_ids.
+        The full sequence fed to the model is as follows:
+            ```
+            adapter.start_ids
+            adapter.encode_and_decorate(user_input_1)
+            output_1_generated_by_model
+            adapter.sep_ids
+            adapter.encode_and_decorate(user_input_2)
+            output_2_generated_by_model
+            adapter.sep_ids
+            adapter.encode_and_decorate(user_input_3)
+            ```
+        Thus adapter is responsible for providing model specific
+        ``start_ids``, ``sep_ids``, and method to encode single prompt.
+    """
+    def __init__(self, tokenizer: PreTrainedTokenizerBase):
+        self.tokenizer = tokenizer
+    def encode_and_decorate(self, prompt, add_special_tokens=False):
+        """Model specific method to encode and decorate prompt."""
+        raise NotImplementedError
+    def decode(self, value):
+        """Model specific method to decode single value to string."""
+        raise NotImplementedError
+    @property
+    def stopping_criteria(self):
+        """Model specific stopping criteria for generation."""
+        return None
+    @property
+    def start_ids(self):
+        """Model specific start ids."""
+        return [self.tokenizer.bos_token_id]
+    @property
+    def sep_ids(self):
+        """Model specific separation ids."""
+        return [self.tokenizer.bos_token_id]
+class BasicAdapter(BaseAdapter):
+    """Basic adapter for slow tokenizers."""
+    def encode_and_decorate(self, prompt, add_special_tokens=False):
+        """Encode prompt.
+        Note:
+            we leave <bos> to session manager to add.
+        """
+        input_ids = self.tokenizer.encode(
+            prompt,
+            add_special_tokens=add_special_tokens,
+            return_tensors='pt',
+        )
+        logger.debug(f'Encode {prompt} to {input_ids}')
+        return input_ids
+    def decode(self, value):
+        """Fallback when tokenizer is not fast."""
+        self.tokenizer: PreTrainedTokenizer
+        tok = self.tokenizer.decode(value)
+        return tok + ' '
+class BasicAdapterFast(BaseAdapter):
+    """Basic adapter for slow tokenizers."""
+    hex_regex = re.compile(r'^<0x([0-9ABCDEF]+)>$')
+    def encode_and_decorate(self, prompt, add_special_tokens=False):
+        """Encode prompt.
+        Note:
+            we leave <bos> to session manager to add.
+        """
+        input_ids = self.tokenizer.encode(
+            prompt,
+            add_special_tokens=add_special_tokens,
+            return_tensors='pt',
+        )
+        logger.debug(f'Encode {prompt} to {input_ids}')
+        return input_ids
+    def decode(self, value):
+        """Decode with fast tokenizers."""
+        self.tokenizer: PreTrainedTokenizerFast
+        tok = self.tokenizer._convert_id_to_token(value)
+        if tok.startswith('▁'):  # sentencepiece
+            space = ' '
+            tok = tok[1:]
+        else:
+            space = ''
+        if res := self.hex_regex.match(tok):
+            tok = chr(int(res.group(1), 16))
+        if tok == '</s>' or tok == '\r':
+            tok = '\n'
+        tok = space + tok
+        logger.debug(f'Decode {value} to {repr(tok)}')
+        return tok
--- a/lmdeploy/legacy/pytorch/adapters/internlm.py
+++ b/lmdeploy/legacy/pytorch/adapters/internlm.py
+# Copyright (c) OpenMMLab. All rights reserved.
+import re
+import torch
+from transformers import (PreTrainedTokenizerFast, StoppingCriteria,
+                          StoppingCriteriaList)
+from lmdeploy.utils import get_logger
+from .base import BaseAdapter
+logger = get_logger(__name__)
+class InternLMStoppingCriteria(StoppingCriteria):
+    """Stopping criteria for HF version of InternLM."""
+    def __call__(self, input_ids, *args, **kwargs) -> bool:
+        return input_ids[0, -1] in [2, 103028]
+class InternLMAdapter(BaseAdapter):
+    """Adapter for InternLM.
+    InternLM use the following template and \n should be 13.
+        <bos> (no actual newline here, just for better readability)
+        <|User|>:{prompt}<eoh>\n
+        <|Bot|>:{model_output}<eoa>\n
+        <|User|>:{prompt}<eoh>\n
+        <|Bot|>:{model_output}<eoa>\n
+        ...
+        <eos>
+    """
+    hex_regex = re.compile(r'^<0x([0-9ABCDEF]+)>$')
+    # ids of '<|User|>:'
+    B_USER_ID = torch.tensor([[333, 352, 1621, 352, 27232]])
+    # ids of '<eoh>\n<|Bot|>:'
+    E_USER_ID = torch.tensor([[103027, 13, 333, 352, 23845, 352, 27232]])
+    # ids of '<bos>'
+    start_ids = [1]
+    # ids of '\n'
+    sep_ids = [13]
+    def __init__(self, tokenizer: PreTrainedTokenizerFast):
+        self.tokenizer = tokenizer
+    def encode_and_decorate(self, prompt):
+        r"""Encode prompt and decorate with template.
+        Note:
+            we leave <bos> and chat history for session manager to add,
+        so we will decorate input_ids to '<|User|>:{prompt}<eoh>\n<|Bot|>:'
+        """
+        input_ids = self.tokenizer.encode(
+            prompt,
+            add_special_tokens=False,
+            return_tensors='pt',
+        )
+        # This is f'<|User|>:{prompt}<eoh>\n<|Bot|>:'
+        # but force \n to 13 instead of 364
+        input_ids = torch.cat([self.B_USER_ID, input_ids, self.E_USER_ID],
+                              dim=1)
+        return input_ids
+    def decode(self, value):
+        """Decode generated tokens for InternLM."""
+        tok = self.tokenizer.decode(value)
+        if res := self.hex_regex.match(tok):
+            tok = chr(int(res.group(1), 16))
+        if tok == '</s>' or tok == '<eoa>' or tok == '\r':
+            tok = '\n'
+        logger.debug(f'Decode {value} to {repr(tok)}')
+        return tok
+    @property
+    def stopping_criteria(self):
+        return StoppingCriteriaList([InternLMStoppingCriteria()])