"docs/developer_guide/setup_github_runner.md" did not exist on "33f0de337d978b37c63b98575b4962c6e6479e8c"
Commit fe851fbc authored by zhouxiang's avatar zhouxiang
Browse files

0.2.6版本新增文件补充

parent e2d98ddc
# 请求分发服务器
请求分发服务可以将多个 api_server 服务,进行并联。用户可以只需要访问代理 URL,就可以间接访问不同的 api_server 服务。代理服务内部会自动分发请求,做到负载均衡。
## 启动
启动代理服务:
```shell
python3 -m lmdeploy.serve.proxy.proxy --server_name {server_name} --server_port {server_port} --strategy "min_expected_latency"
```
启动成功后,代理服务的 URL 也会被脚本打印。浏览器访问这个 URL,可以打开 Swagger UI。
## API
通过 Swagger UI,我们可以看到多个 API。其中,和 api_server 节点管理相关的有:
- /nodes/status
- /nodes/add
- /nodes/remove
他们分别表示,查看所有的 api_server 服务节点,增加某个节点,删除某个节点。
和使用相关的 api 有:
- /v1/models
- /v1/chat/completions
- /v1/completions
这些 API 的使用方式和 api_server 一样。
## 分发策略
代理服务目前的分发策略如下:
- random: 根据用户提供的各个 api_server 节点的处理请求的能力,进行有权重的随机。处理请求的吞吐量越大,就越有可能被分配。部分节点没有提供吞吐量,将按照其他节点的平均吞吐量对待。
- min_expected_latency: 根据每个节点现有的待处理完的请求,和各个节点吞吐能力,计算预期完成响应所需时间,时间最短的将被分配。未提供吞吐量的节点,同上。
- min_observed_latency: 根据每个节点过去一定数量的请求,处理完成所需的平均用时,用时最短的将被分配。
## LMDeploy-QoS 介绍与用法
### 背景
在过去一段时间,推理框架伴随着LLM和AGI出现。许多推理框架为语言模型提供可扩展和高性能的在线工作负载服务。它们的工作负载通常涉及多个用户群体,而且工作负载在短时间内快速变化。许多推理框架在满足这些多租户流量模式的要求方面存在困难,而且未能很好的规范约束用户的行为,所以我们认为在LLM推理框架考虑多用户负载均衡是很有必要的。
### 多租户处理的用户分类
LMDeploy-QoS与LMDeploy 提供一系列多租户功能。它要求用户使用适当的用户标识(配置文件或代码库中的user_id)标记其推理请求。它是基于字典的配置作为多租户策略。在这个配置中,用户被映射到不同“用户组”中,并配备一个使用配额。我们的多租户策略可以读取配置,并根据其用户组的优先级和预定义配额与实时分配比率之间的差异安排用户推理请求的调度。经过完备的测试,我们的LMDeploy-QoS模块极大地提高了LLM的服务可靠性并提升了大型语言模型推理工作的GPU资源利用率。
LMDeploy将用户分为4组:
- 白金(Platinum)
- 金(Gold)
- 银(Silver)
- 青铜(Bronze)
根据我们在提供LLM服务方面的使用经验,我们可以将以下4种类型的用户映射到这些用户组中:
- Platinum : VIP用户或管理员用户。包括需要不间断使用的的服务开发人员或演示人员。他们的工作负载频率低,对推理工作的资源需求也不高。
- Gold : 签署定期服务的高级用户,他们需要可衡量的可靠服务。例如,某个公司A与LLM服务提供商签订了合同,购买了每秒X个请求的服务能力,可用性为Z%,供A公司员工使用,年付Y百万美元。
- Silver : 绝大多数用户。大多数试用或每月订阅的用户被归类为此类别。他们需要相对较少的服务,但他们的用户体验对于LLM服务的声誉也很重要。
- Bronze : 支付很少费用给LLM提供商的重度用户。
以上引入用户组分类的目的是为了提供指导,而不是为所有LMDeploy用户提供建议,因为这并不一定适用于所有LLM业务提供商。管理员可以对用户的日常负载进行统计,自行决定如何对用户进行分类。
接下来让我们讨论一下LMDeploy如何根据这些分类进行分配请求。
### 多租户策略
#### 策略 1: 用户组之间的优先级调度
我们引入“用户组”概念。由模块使用者来定义哪些用户到用户组的映射(可以理解为 uid 到用户组的映射)。推荐用户组为4组如下:
- Platinum
- Gold
- Silver
- Bronze
四个用户组之间的优先级顺序是严格的 Platinum > Gold > Silver > Bronze 。当系统繁忙的时候,我们会优先执行排名靠前的请求。
下面的图表显示了优先级处理的工作原理。您可以看到 Platinum 请求已被重新设置优先级并移至队列头部。
![](https://github.com/InternLM/lmdeploy/assets/52888924/9d63f081-7168-4c74-8456-24f0a4b41649)
#### 策略 2: 用户组内均摊与软隔离
这个策略仅适用于用户组内部。我们引入了一个用户组内的用户配额配置表。该表定义了用户在 100% GPU 资源中的 “理想份额比例”。每个 “用户” 在列表中以 user_id 的形式出现,并且一个用户只能属于一个用户组。低于配额表上额定值的用户会比高于额定值的用户拥有更高的优先级获得被释放资源而进行更多的推理,直到双方使用量趋近于原始配额比例。此处调度只考虑请求队列中的用户,忽略没有出现在请求队列中的已配置用户。
以下图表展示了这种策略的典型示例。
![](https://github.com/InternLM/lmdeploy/assets/52888924/3e1d7135-6b11-4998-89a1-b72af6c962c3)
#### 策略3:混合机制
是指在一个系统中优先级+均摊/隔离同时开启。执行顺序是先用户组间优先级,再在组内做均摊/隔离实现。这里略去时序图描写。需要注意的是,用户组间的优先级可以压倒性覆盖组内的决策。例如,当低优先级内部的两个用户互相之间有请求顺序调度时,高优先级的请求一旦抵达,将会覆盖所有低优先级的分配逻辑而有限执行高优任务。
![](https://github.com/InternLM/lmdeploy/assets/52888924/e335f976-ff15-48db-b1ff-abf1c3327d6e)
需要注意的是,混合机制可能有其他方法,本文档只介绍了一种在我们场景下有效的方法。其他混合方法需要考虑到优先级和按比例共享明显是相互冲突的策略,因此没有简单的方法将它们混合在单一维度内工作。
### QoS 配置项模板
配置文件通过启动参数`--qos-config-path`指定,并由程序在启动时加载。
配置会和lmdeploy启动脚本等文件放置在一起。配置内容包含:
1. QoS的启用开关,设置为True时后续的QoS和用户相关配置才会生效,设置为False后续配置不会生效;
2. user_groups 是一个列表,包含了多种不同的组间优先级;
3. user_group_map 的映射配置,包含了用户组优先级,组内用户id以及每个用户组内用户的配额分配。
配置项模板如下:
```json
{
"enable_user_qos": true,
"user_groups": [
"Platinum",
"Gold",
"Silver",
"Bronze"
],
"user_group_map": {
"Platinum": [
{
"id": "user_id0",
"quota_pct": 100
},
{
"id": "default",
"quota_pct": 0
}
],
"Gold": [
{
"id": "user_id1",
"quota_pct": 50
},
{
"id": "user_id2",
"quota_pct": 50
}
],
"Silver": [
{
"id": "user_id3",
"quota_pct": 5
},
{
"id": "default",
"quota_pct": 95
}
],
"Bronze": [
{
"id": "user_id4",
"quota_pct": 30
},
{
"id": "user_id5",
"quota_pct": 30
},
{
"id": "user_id6",
"quota_pct": 40
},
{
"id": "default",
"quota_pct": 0
}
]
}
}
```
### 如何使用 LMDeploy-QoS 感知进行推理
我们提供以下代码链接,展示如何调用具有多租户策略感知的推理请求,在 HTTP Body 中,与 QoS 相关的参数如下:
/v1/chat/interactive_qos
```bash
curl -X POST http://localhost/v1/chat/interactive_qos \
-H "Content-Type: application/json" \
-d '{
"prompt": "Hello,Hello",
"session_id": -1,
"interactive_mode": false,
"stream": false,
"stop": false,
"request_output_len": 512,
"top_p": 0.8,
"top_k": 40,
"temperature": 0.8,
"repetition_penalty": 1,
"ignore_eos": false,
"user_id": "user_id0"
}'
```
/v1/chat/completions_qos
```bash
curl -X POST http://localhost/v1/chat/completions_qos \
-H "Content-Type: application/json" \
-d '{
"model": "internlm-chat-7b",
"messages": "Hello,Hello",
"temperature": 0.7,
"top_p": 1,
"n": 1,
"max_tokens": 512,
"stop": false,
"stream": false,
"presence_penalty": 0,
"frequency_penalty": 0,
"repetition_penalty": 1,
"session_id": -1,
"ignore_eos": false,
"user_id": "user_id0"
}'
```
/v1/completions_qos
```bash
curl -X POST http://localhost/v1/completions_qos \
-H "Content-Type: application/json" \
-d '{
"model": "internlm-chat-7b",
"prompt": "Hello,Hello",
"suffix": "string",
"temperature": 0.7,
"n": 1,
"max_tokens": 16,
"stop": "string",
"stream": false,
"top_p": 1,
"repetition_penalty": 1,
"session_id": -1,
"ignore_eos": false,
"user_id": "user_id0"
}'
```
### 配置文件修改
配置文件模板路径为:`lmdeploy/server/qos_engine/qos_config.json.template`,可以根据实际需求添加需要配置的用户,设置正确的优先级以及quota值。
### 配置参数传入
启动api_server时,通过`--qos-config-path`,将配置文件及路径传入,示例如下:
```bash
CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server internlm/internlm-chat-7b --server-port 8000 --qos-config-path lmdeploy/serve/qos_engine/qos_config.json.template
```
### 贡献者
[Eric](https://github.com/rhinouser0), [sallyjunjun](https://github.com/sallyjunjun), [sfireworks](https://github.com/sfireworks), [Dofgal](https://github.com/Dofgal), [shadow](https://github.com/awslshadowstar)
# 支持的模型
## TurboMind 支持的模型
| 模型 | 模型规模 | FP16/BF16 | KV INT8 | W4A16 |
| :----------------: | :------: | :-------: | :-----: | :---: |
| Llama | 7B - 65B | Yes | Yes | Yes |
| Llama2 | 7B - 70B | Yes | Yes | Yes |
| InternLM | 7B - 20B | Yes | Yes | Yes |
| InternLM2 | 7B - 20B | Yes | - | Yes |
| InternLM-XComposer | 7B | Yes | Yes | Yes |
| QWen | 7B - 72B | Yes | Yes | Yes |
| QWen-VL | 7B | Yes | Yes | Yes |
| Baichuan | 7B | Yes | Yes | Yes |
| Baichuan2 | 7B | Yes | Yes | Yes |
| Code Llama | 7B - 34B | Yes | No | No |
| YI | 6B - 34B | Yes | No | No |
### PyTorch 支持的模型
| 模型 | 模型规模 | FP16/BF16 | KV INT8 | W8A8 |
| :----------: | :-------: | :-------: | :-----: | :--: |
| Llama | 7B - 65B | Yes | No | Yes |
| Llama2 | 7B - 70B | Yes | No | Yes |
| InternLM | 7B - 20B | Yes | No | Yes |
| InternLM2 | 7B - 20B | Yes | No | - |
| Baichuan2 | 7B - 13B | Yes | No | Yes |
| ChatGLM2 | 6B | Yes | No | No |
| Falcon | 7B - 180B | Yes | No | No |
| YI | 6B - 34B | Yes | No | No |
| Mistral | 7B | Yes | No | No |
| Mixtral | 8x7B | Yes | No | No |
| QWen1.5 | 7B - 72B | Yes | No | No |
| DeepSeek-MoE | 16B | Yes | No | No |
| Gemma | 2B-7B | Yes | No | No |
How to generate start_ids.csv
```bash
# update `model_file` path and `encode_line` content according to the actual situation
python3 tokenizer.py --model_file /workdir/llama2_13b_chat/tokenizer.model --encode_line 'LMDeploy is a toolkit for compressing, deploying, and serving LLMs.'
# refer to tokenizer.py for more usage scenarios
```
1,365,5773,1022,2376,338,263,5780,7354,363,27122,292,29892,7246,292,29892,322,16330,365,26369,29879,29889
# Vision-Language Web Demo
A chatbot demo with image input.
## Supported Models
- [InternLM/InternLM-XComposer](https://github.com/InternLM/InternLM-XComposer/tree/main)
- [Qwen/Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat)
## Quick Start
### internlm/internlm-xcomposer-7b
- extract llm model from huggingface model
```python
python extract_xcomposer_llm.py
# the llm part will saved to internlm_model folder.
```
- lanuch the demo
```python
python app.py --model-name internlm-xcomposer-7b --llm-ckpt internlm_model
```
### Qwen-VL-Chat
- lanuch the dmeo
```python
python app.py --model-name qwen-vl-chat --hf-ckpt Qwen/Qwen-VL-Chat
```
## Limitations
- this demo uses the code in their repo to extract image features that might not very efficiency.
- this demo only contains the chat function. If you want to use localization ability in Qwen-VL-Chat or article generation function in InternLM-XComposer, you need implement these pre/post processes. The difference compared to chat is how to build prompts and use the output of model.
import argparse
import os
import random
from contextlib import contextmanager
from dataclasses import dataclass, field
from itertools import count
from pathlib import Path
from threading import Lock
from typing import List, Tuple
import gradio as gr
from packaging.version import Version, parse
from qwen_model import QwenVLChat
from xcomposer_model import InternLMXComposer
from lmdeploy.serve.gradio.constants import CSS, THEME, disable_btn, enable_btn
from lmdeploy.turbomind import TurboMind
from lmdeploy.turbomind.chat import valid_str
BATCH_SIZE = 32
DEFAULT_MODEL_NAME = 'internlm-xcomposer-7b'
DEFAULT_HF_CKPT = 'internlm/internlm-xcomposer-7b'
# should use extract_xcomposer_llm.py to extract llm
# when use internlm-xcomposer-7b
DEFAULT_LLM_CKPT = None
SUPPORTED_MODELS = {
'internlm-xcomposer-7b': InternLMXComposer,
'qwen-vl-chat': QwenVLChat
}
if parse(gr.__version__) >= Version('4.0.0'):
que_kwargs = {'default_concurrency_limit': BATCH_SIZE}
else:
que_kwargs = {'concurrency_count': BATCH_SIZE}
@dataclass
class Session:
_lock = Lock()
_count = count()
_session_id: int = None
_message: List[Tuple[str, str]] = field(default_factory=list)
_step: int = 0
def __init__(self):
with Session._lock:
self._session_id = next(Session._count)
self._message = []
self._step = 0
@property
def session_id(self):
return self._session_id
@property
def message(self):
return self._message
@property
def step(self):
return self._step
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--model-name',
type=str,
default=DEFAULT_MODEL_NAME,
help='Model name, default to %(default)s')
parser.add_argument(
'--hf-ckpt',
type=str,
default=DEFAULT_HF_CKPT,
help='hf checkpoint name or path, default to %(default)s')
parser.add_argument(
'--llm-ckpt',
type=str,
default=DEFAULT_LLM_CKPT,
help='LLM checkpoint name or path, default to %(default)s')
parser.add_argument('--server-port',
type=int,
default=9006,
help='Server port, default %(default)s')
parser.add_argument('--server-name',
type=str,
default='0.0.0.0',
help='Server name, default %(default)s')
args = parser.parse_args()
return args
@contextmanager
def get_stop_words():
from lmdeploy.tokenizer import Tokenizer
old_func = Tokenizer.indexes_containing_token
def new_func(self, token):
indexes = self.encode(token, add_bos=False)
return indexes
Tokenizer.indexes_containing_token = new_func
yield
Tokenizer.indexes_containing_token = old_func
def load_preprocessor_model(args):
"""Load preprocessor and llm inference engine."""
assert args.model_name in SUPPORTED_MODELS
llm_ckpt = args.hf_ckpt if args.llm_ckpt is None else args.llm_ckpt
preprocessor = SUPPORTED_MODELS[args.model_name](args.hf_ckpt)
with get_stop_words():
model = TurboMind.from_pretrained(llm_ckpt, model_name=args.model_name)
return preprocessor, model
def launch_demo(args, preprocessor, model):
def add_image(chatbot, session, file):
"""Append image to query."""
chatbot = chatbot + [((file.name, ), None)]
history = session._message
# [([user, url, url], assistant), ...]
if len(history) == 0 or history[-1][-1] is not None:
history.append([[file.name], None])
else:
history[-1][0].append(file.name)
return chatbot, session
def add_text(chatbot, session, text):
"""User query."""
chatbot = chatbot + [(text, None)]
history = session._message
if len(history) == 0 or history[-1][-1] is not None:
history.append([text, None])
else:
history[-1][0].insert(0, text)
return chatbot, session, disable_btn, enable_btn
def chat(
chatbot,
session,
request_output_len=512,
):
"""Chat with AI assistant."""
generator = model.create_instance()
history = session._message
sequence_start = len(history) == 1
seed = random.getrandbits(64) if sequence_start else None
input_ids, features, ranges = preprocessor.prepare_query(
history[-1][0], sequence_start)
if len(input_ids
) + session.step + request_output_len > model.model.session_len:
gr.Warning('WARNING: exceed session max length.'
' Please restart the session by reset button.')
yield chatbot, session, enable_btn, disable_btn, enable_btn
else:
response_size = 0
step = session.step
for outputs in generator.stream_infer(
session_id=session.session_id,
input_ids=input_ids,
input_embeddings=features,
input_embedding_ranges=ranges,
request_output_len=request_output_len,
stream_output=True,
sequence_start=sequence_start,
random_seed=seed,
step=step):
res, tokens = outputs[0]
# decode res
response = model.tokenizer.decode(res.tolist(),
offset=response_size)
if response.endswith('�'):
continue
response = valid_str(response)
response_size = tokens
if chatbot[-1][1] is None:
chatbot[-1][1] = ''
history[-1][1] = ''
chatbot[-1][1] += response
history[-1][1] += response
session._step = step + len(input_ids) + tokens
yield chatbot, session, disable_btn, enable_btn, disable_btn
yield chatbot, session, enable_btn, disable_btn, enable_btn
def stop(session):
"""Stop the session."""
generator = model.create_instance()
for _ in generator.stream_infer(session_id=session.session_id,
input_ids=[0],
request_output_len=0,
sequence_start=False,
sequence_end=False,
stop=True):
pass
def cancel(chatbot, session):
"""Stop the session and keey chat history."""
stop(session)
return chatbot, session, disable_btn, enable_btn, enable_btn
def reset(session):
"""Reset a new session."""
stop(session)
session._step = 0
session._message = []
return [], session, enable_btn
with gr.Blocks(css=CSS, theme=THEME) as demo:
with gr.Column(elem_id='container'):
gr.Markdown('## LMDeploy VL Playground')
chatbot = gr.Chatbot(elem_id='chatbot', label=model.model_name)
query = gr.Textbox(placeholder='Please input the instruction',
label='Instruction')
session = gr.State()
with gr.Row():
addimg_btn = gr.UploadButton('Upload Image',
file_types=['image'])
cancel_btn = gr.Button(value='Cancel', interactive=False)
reset_btn = gr.Button(value='Reset')
addimg_btn.upload(add_image, [chatbot, session, addimg_btn],
[chatbot, session],
show_progress=True,
queue=True)
send_event = query.submit(
add_text, [chatbot, session, query], [chatbot, session]).then(
chat, [chatbot, session],
[chatbot, session, query, cancel_btn, reset_btn])
query.submit(lambda: gr.update(value=''), None, [query])
cancel_btn.click(cancel, [chatbot, session],
[chatbot, session, cancel_btn, reset_btn, query],
cancels=[send_event])
reset_btn.click(reset, [session], [chatbot, session, query],
cancels=[send_event])
demo.load(lambda: Session(), inputs=None, outputs=[session])
demo.queue(api_open=True, **que_kwargs, max_size=100)
demo.launch(
share=True,
server_port=args.server_port,
server_name=args.server_name,
)
def main():
args = parse_args()
cur_folder = Path(__file__).parent.as_posix()
if cur_folder != os.getcwd():
os.chdir(cur_folder)
print(f'change working dir to {cur_folder}')
preprocessor, model = load_preprocessor_model(args)
launch_demo(args, preprocessor, model)
if __name__ == '__main__':
main()
import os
from pathlib import Path
import torch
from transformers import AutoModel, AutoTokenizer
from xcomposer_model import InternLMXComposerTemplate # noqa
model = AutoModel.from_pretrained('internlm/internlm-xcomposer-7b',
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer-7b',
trust_remote_code=True)
internlm_model = model.internlm_model
lora_layers = [
'self_attn.q_proj', 'self_attn.v_proj', 'mlp.down_proj', 'mlp.up_proj'
]
def get_attr(m, key):
keys = key.split('.')
for key in keys:
m = getattr(m, key)
return m
# merge lora
for i in range(len(internlm_model.model.layers)):
layer = internlm_model.model.layers[i]
for key in lora_layers:
lora_linear = get_attr(layer, key)
lora_b = lora_linear.lora_B
lora_a = lora_linear.lora_A
w_ba = torch.matmul(lora_b.weight, lora_a.weight)
lora_linear.weight.data += w_ba.data
# save model
cur_folder = Path(__file__).parent
dst_path = os.path.join(cur_folder, 'internlm_model')
internlm_model.save_pretrained(dst_path)
tokenizer.save_pretrained(dst_path)
import os
from glob import glob
import numpy as np
import torch
from accelerate import init_empty_weights
from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from lmdeploy.model import MODELS, Qwen7BChat
@MODELS.register_module(name='qwen-vl-chat')
class QwenVLChatTemplate(Qwen7BChat):
"""Qwen vl chat template."""
def __init__(self,
session_len=8192,
top_p=0.3,
top_k=None,
temperature=1.0,
**kwargs):
super().__init__(**kwargs)
self.session_len = session_len
self.top_p = top_p
self.top_k = top_k
self.temperature = temperature
def _concat_image_info(self, prompt):
"""Append image placeholder."""
if isinstance(prompt, str):
return prompt
prompt, nimg = prompt
res = ''
for i in range(nimg):
res += f'Picture {str(i)}:<img>placeholder</img>\n'
prompt = res + prompt
return prompt
def get_prompt(self, prompt, sequence_start=True):
"""Apply chat template to prompt."""
prompt = self._concat_image_info(prompt)
return super().get_prompt(prompt, sequence_start)
def messages2prompt(self, messages, sequence_start=True):
"""Apply chat template to history."""
if isinstance(messages, str) or isinstance(messages[0], str):
return self.get_prompt(messages, sequence_start)
box_map = dict(user=self.user,
assistant=self.assistant,
system=self.system)
eox_map = dict(user=self.eoh,
assistant=self.eoa + self.separator,
system=self.eosys)
ret = ''
if self.meta_instruction is not None:
if len(messages) and messages[0]['role'] != 'system':
ret += f'{self.system}{self.meta_instruction}{self.eosys}'
for message in messages:
role = message['role']
content = message['content']
if role == 'user' and not isinstance(content, str):
content = [content[0]['text'], len(content) - 1]
content = self._concat_image_info(content)
ret += f'{box_map[role]}{content}{eox_map[role]}'
ret += f'{self.assistant}'
return ret
class QwenVLChat:
"""Qwen vl preprocessor to prepare the inputs for a model."""
def __init__(self, pretrained_model_name_or_path, **kwargs):
self.pretrained_model_name_or_path = pretrained_model_name_or_path
self.decorator = QwenVLChatTemplate(**kwargs)
self._load_model()
def _load_model(self):
path = self.pretrained_model_name_or_path
if not os.path.exists(path):
path = snapshot_download(path)
self.tokenizer = AutoTokenizer.from_pretrained(path,
trust_remote_code=True)
with init_empty_weights():
config = AutoConfig.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_config(config,
trust_remote_code=True)
del model.lm_head
for key in ['wte', 'h', 'ln_f']:
setattr(model.transformer, key, None)
model.to_empty(device='cpu')
named_parameters = set()
for key, _ in model.named_parameters():
named_parameters.add(key)
# TODO: load bin according to index.json
bins = glob(os.path.join(path, '*.bin'))
for bin in bins:
dt = torch.load(bin, map_location='cpu')
missed, _ = model.load_state_dict(dt, strict=False)
named_parameters.difference_update(set(missed))
assert len(
named_parameters) == 0, f'missing keys: {named_parameters}'
self.model = model.to('cuda').eval()
@torch.no_grad()
def encode_img(self, paths):
"""Extract image features."""
if len(paths) == 0:
return None
features = []
# with torch.cuda.amp.autocast(dtype=torch.float16):
features = self.model.transformer.visual.encode(paths).float()
features = [x.cpu().numpy() for x in features]
return features
def _to_inputs(self, decorate_text, image_paths, sequence_start):
features = self.encode_img(image_paths)
input_ids = self.tokenizer.encode(decorate_text)
ranges = None
if features is not None:
input_ids_arr = np.array(input_ids)
begins = np.where(
input_ids_arr == self.tokenizer.img_start_id)[0] + 1
ends = np.where(input_ids_arr == self.tokenizer.img_end_id)[0]
ranges = np.stack([begins, ends], axis=1)
assert len(features) == len(ranges)
return input_ids, features, ranges
def prepare_query(self, query, sequence_start=True):
"""Convert query to input_ids, features and the ranges of features to
input_ids."""
image_paths = []
if not isinstance(query, str):
query, image_paths = query[0], query[1:]
decorate_text = self.decorator.get_prompt((query, len(image_paths)),
sequence_start)
return self._to_inputs(decorate_text, image_paths, sequence_start)
def prepare_message(self, messages):
"""Convert messages to input_ids, features and the ranges of features
to input_ids."""
decorate_text = self.decorator.messages2prompt(messages, True)
image_paths = []
for msg in messages:
if msg['role'] == 'user':
content = msg['content']
if isinstance(content, str):
continue
for item in content:
if item['type'] == 'image_url':
url = item['image_url']['url']
image_paths.append(url)
return self._to_inputs(decorate_text, image_paths, True)
import os
# from safetensors.torch import load_file
from collections.abc import Sequence
from glob import glob
import numpy as np
import torch
from accelerate import init_empty_weights
from huggingface_hub import snapshot_download
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from lmdeploy.model import MODELS, BaseChatTemplate
meta_instruction = """meta instruction
You are an AI assistant whose name is 浦语.
- 浦语 is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- 浦语 can understand and communicate fluently in the language chosen by the user such as English and 中文.
conversation
""" # noqa
@MODELS.register_module(name='internlm-xcomposer-7b')
class InternLMXComposerTemplate(BaseChatTemplate):
"""Internlm xcomposer chat template."""
def __init__(self,
meta_instruction=meta_instruction,
user=' <|User|>: ',
assistant=' <|Bot|>: ',
eoh='<TOKENS_UNUSED_0>',
eoa='<TOKENS_UNUSED_1>',
stop_words=['<TOKENS_UNUSED_0>', '<TOKENS_UNUSED_1>'],
image_placeholder='<Img><ImageHere></Img>',
**kwargs):
super().__init__(**kwargs)
self.meta_instruction = meta_instruction
self.user = user
self.assistant = assistant
self.eoh = eoh
self.eoa = eoa
self.stop_words = stop_words
self.image_placeholder = image_placeholder
def _concat_image_info(self, prompt):
"""Append image placeholder."""
if isinstance(prompt, str):
return prompt
prompt, nimg = prompt
assert nimg <= 1
if nimg == 1:
prompt = f'{self.image_placeholder}{prompt}'
return prompt
def get_prompt(self, prompt, sequence_start=True):
"""Apply chat template to prompt."""
prompt = self._concat_image_info(prompt)
return super().get_prompt(prompt, sequence_start)
def messages2prompt(self, messages, sequence_start=True):
"""Apply chat template to history."""
if isinstance(messages, str) or isinstance(messages[0], str):
return self.get_prompt(messages, sequence_start)
box_map = dict(user=self.user,
assistant=self.assistant,
system=self.system)
eox_map = dict(user=self.eoh,
assistant=self.eoa + self.separator,
system=self.eosys)
ret = ''
if self.meta_instruction is not None:
if len(messages) and messages[0]['role'] != 'system':
ret += f'{self.system}{self.meta_instruction}{self.eosys}'
for message in messages:
role = message['role']
content = message['content']
if role == 'user' and not isinstance(content, str):
assert isinstance(content, Sequence)
assert all(isinstance(item, dict) for item in content)
content = [content[0]['text'], len(content) - 1]
content = self._concat_image_info(content)
ret += f'{box_map[role]}{content}{eox_map[role]}'
ret += f'{self.assistant}'
return ret
class InternLMXComposer:
"""Internlm-xcomposer preprocessor to prepare the inputs for a model."""
def __init__(self, pretrained_model_name_or_path, **kwargs):
self.pretrained_model_name_or_path = pretrained_model_name_or_path
self.decorator = InternLMXComposerTemplate(**kwargs)
self._load_model()
def _load_model(self):
path = self.pretrained_model_name_or_path
if not os.path.exists(path):
path = snapshot_download(path)
self.tokenizer = AutoTokenizer.from_pretrained(path,
trust_remote_code=True)
with init_empty_weights():
config = AutoConfig.from_pretrained(path, trust_remote_code=True)
config.num_hidden_layers = 0 # speedup
model = AutoModelForCausalLM.from_config(config,
trust_remote_code=True)
model.internlm_model = None
model.to_empty(device='cpu')
named_parameters = set()
for key, _ in model.named_parameters():
named_parameters.add(key)
# TODO: load bin according to index.json
bins = glob(os.path.join(path, '*.bin'))
# bins = glob(os.path.join(path, '*.safetensors'))
for bin in bins:
dt = torch.load(bin, map_location='cpu')
# dt = load_file(bin)
missed, _ = model.load_state_dict(dt, strict=False)
named_parameters.difference_update(set(missed))
assert len(
named_parameters) == 0, f'missing keys: {named_parameters}'
self.model = model.to('cuda').eval()
@torch.no_grad()
def encode_img(self, paths):
"""Extract image features."""
if len(paths) == 0:
return None
features = []
with torch.cuda.amp.autocast(dtype=torch.float16):
for path in paths:
out = self.model.encode_img(path)
features.append(out.squeeze().cpu().numpy())
return features
def _to_inputs(self, decorate_text, image_paths, sequence_start):
features = self.encode_img(image_paths)
input_ids = []
ranges = None
begins = []
segs = decorate_text.split(self.decorator.image_placeholder)
image_dim = features[-1].shape[0] if features is not None else 0
for i, seg in enumerate(segs):
if i > 0:
begins.append(len(input_ids))
input_ids.extend([0] * image_dim)
seg_ids = self.tokenizer.encode(
seg, add_special_tokens=((i == 0) and sequence_start))
input_ids.extend(seg_ids)
if features is not None:
ends = np.array(begins) + image_dim
ranges = np.stack([begins, ends], axis=1).tolist()
return input_ids, features, ranges
def prepare_query(self, query, sequence_start=True):
"""Convert query to input_ids, features and the ranges of features to
input_ids."""
image_paths = []
if not isinstance(query, str):
query, image_paths = query[0], query[1:]
if len(image_paths) > 1:
print('does not support multiple images, use last one.')
image_paths = image_paths[-1:]
decorate_text = self.decorator.get_prompt((query, len(image_paths)))
return self._to_inputs(decorate_text, image_paths, sequence_start)
def prepare_message(self, messages):
"""Convert messages to input_ids, features and the ranges of features
to input_ids."""
decorate_text = self.decorator.messages2prompt(messages, True)
image_paths = []
for msg in messages:
if msg['role'] == 'user':
content = msg['content']
if isinstance(content, str):
continue
for item in content:
if item['type'] == 'image_url':
url = item['image_url']['url']
image_paths.append(url)
return self._to_inputs(decorate_text, image_paths, True)
# Copyright (c) OpenMMLab. All rights reserved.
from .cli import run
if __name__ == '__main__':
run()
# Copyright (c) OpenMMLab. All rights reserved.
import os
from typing import Literal, Optional, Union
from lmdeploy.serve.async_engine import AsyncEngine
from lmdeploy.serve.vl_async_engine import VLAsyncEngine
from lmdeploy.utils import get_hf_config_content
from .messages import PytorchEngineConfig, TurbomindEngineConfig
from .utils import get_logger
SUPPORTED_TASKS = {'llm': AsyncEngine, 'vlm': VLAsyncEngine}
logger = get_logger('lmdeploy')
def autoget_backend(model_path: str) -> Union[Literal['turbomind', 'pytorch']]:
"""Get backend type in auto backend mode.
Args:
model_path (str): the path of a model.
It could be one of the following options:
- i) A local directory path of a turbomind model which is
converted by `lmdeploy convert` command or download from
ii) and iii).
- ii) The model_id of a lmdeploy-quantized model hosted
inside a model repo on huggingface.co, such as
"InternLM/internlm-chat-20b-4bit",
"lmdeploy/llama2-chat-70b-4bit", etc.
- iii) The model_id of a model hosted inside a model repo
on huggingface.co, such as "internlm/internlm-chat-7b",
"Qwen/Qwen-7B-Chat ", "baichuan-inc/Baichuan2-7B-Chat"
and so on.
Returns:
str: the backend type.
"""
from lmdeploy.pytorch.supported_models import \
is_supported as is_supported_pytorch
pytorch_has, turbomind_has = False, False
try:
from lmdeploy.turbomind.supported_models import \
is_supported as is_supported_turbomind
turbomind_has = is_supported_turbomind(model_path)
except ImportError:
logger.warning(
'Lmdeploy with turbomind engine is not installed correctly. '
'You may need to install lmdeploy from pypi or build from source '
'for turbomind engine.')
pytorch_has = is_supported_pytorch(model_path)
if not (pytorch_has or turbomind_has):
logger.warning(f'{model_path} is not explicitly supported by lmdeploy.'
f' Try to run with lmdeploy pytorch engine.')
backend = 'turbomind' if turbomind_has else 'pytorch'
return backend
def autoget_backend_config(
model_path: str,
backend_config: Optional[Union[PytorchEngineConfig,
TurbomindEngineConfig]] = None
) -> Union[PytorchEngineConfig, TurbomindEngineConfig]:
"""Get backend config automatically.
Args:
model_path (str): The input model path.
backend_config (TurbomindEngineConfig | PytorchEngineConfig): The
input backend config. Default to None.
Returns:
(PytorchEngineConfig | TurbomindEngineConfig): The auto-determined
backend engine config.
"""
from dataclasses import asdict
backend = autoget_backend(model_path)
if backend == 'pytorch':
config = PytorchEngineConfig()
else:
config = TurbomindEngineConfig()
if backend_config is not None:
data = asdict(backend_config)
for k, v in data.items():
if v and hasattr(config, k):
setattr(config, k, v)
return config
def check_vl_llm(config: dict) -> bool:
"""check if the model is a vl model from model config."""
arch = config['architectures'][0]
if arch == 'LlavaLlamaForCausalLM':
return True
elif arch == 'QWenLMHeadModel' and 'visual' in config:
return True
return False
def get_task(model_path: str):
"""get pipeline type and pipeline class from model config."""
if os.path.exists(os.path.join(model_path, 'triton_models', 'weights')):
# workspace model
return 'llm', AsyncEngine
config = get_hf_config_content(model_path)
if check_vl_llm(config):
return 'vlm', VLAsyncEngine
# default task, pipeline_class
return 'llm', AsyncEngine
# Copyright (c) OpenMMLab. All rights reserved.
from .chat import SubCliChat
from .cli import CLI
from .lite import SubCliLite
from .serve import SubCliServe
def run():
"""The entry point of running LMDeploy CLI."""
CLI.add_parsers()
SubCliChat.add_parsers()
SubCliServe.add_parsers()
SubCliLite.add_parsers()
parser = CLI.parser
args = parser.parse_args()
if 'run' in dir(args):
args.run(args)
else:
try:
args.print_help()
except AttributeError:
command = args.command
if command == 'serve':
SubCliServe.parser.print_help()
elif command == 'lite':
SubCliLite.parser.print_help()
elif command == 'chat':
SubCliChat.parser.print_help()
else:
parser.print_help()
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
from typing import List
class DefaultsAndTypesHelpFormatter(argparse.HelpFormatter):
"""Formatter to output default value and type in help information."""
def _get_help_string(self, action):
"""Add default and type info into help."""
help = action.help
if '%(default)' not in action.help:
if action.default is not argparse.SUPPRESS:
defaulting_nargs = [argparse.OPTIONAL, argparse.ZERO_OR_MORE]
if (action.option_strings or action.nargs
in defaulting_nargs) and 'default' not in help.lower():
help += '. Default: %(default)s'
if action.type:
help += '. Type: %(type)s'
return help
def convert_args(args):
"""Convert args to dict format."""
special_names = ['run', 'command']
kwargs = {
k[0]: k[1]
for k in args._get_kwargs() if k[0] not in special_names
}
return kwargs
def get_lora_adapters(adapters: List[str]):
"""Parse lora adapers from cli input.
Args:
adapters (List[str]): CLI input string of lora adapter path(s).
Returns:
Dict[str,str] or None: Parsed lora adapter path(s).
"""
if not adapters:
return None
n = len(adapters)
output = {}
if n == 1:
name = 'default'
path = adapters[0].strip()
if '=' in path:
name, path = path.split('=', 1)
output[name] = path
else:
for pair in adapters:
assert '=' in pair, f'Multiple lora paths must in format of ' \
f'xxx=yyy. But given: {pair}'
name, path = pair.strip().split('=', 1)
assert name not in output, f'Multiple lora paths with ' \
f'repeated lora name: {name}'
output[name] = path
return output
class ArgumentHelper:
"""Helper class to add unified argument."""
@staticmethod
def model_name(parser):
"""Add argument model_name to parser."""
return parser.add_argument(
'--model-name',
type=str,
default=None,
help='The name of the to-be-deployed model, such as'
' llama-7b, llama-13b, vicuna-7b and etc. You '
'can run `lmdeploy list` to get the supported '
'model names')
@staticmethod
def model_format(parser, default: str = None):
return parser.add_argument(
'--model-format',
type=str,
default=default,
choices=['hf', 'llama', 'awq'],
help='The format of input model. `hf` meaning `hf_llama`, `llama` '
'meaning `meta_llama`, `awq` meaning the quantized model by awq')
@staticmethod
def tp(parser):
"""Add argument tp to parser."""
return parser.add_argument(
'--tp',
type=int,
default=1,
help='GPU number used in tensor parallelism. Should be 2^n')
@staticmethod
def session_id(parser):
"""Add argument session_id to parser."""
return parser.add_argument('--session-id',
type=int,
default=1,
help='The identical id of a session')
@staticmethod
def session_len(parser, default: int = None):
return parser.add_argument('--session-len',
type=int,
default=default,
help='The max session length of a sequence')
@staticmethod
def max_batch_size(parser):
"""Add argument max_batch_size to parser."""
return parser.add_argument('--max-batch-size',
type=int,
default=128,
help='Maximum batch size')
@staticmethod
def quant_policy(parser):
"""Add argument quant_policy to parser."""
return parser.add_argument('--quant-policy',
type=int,
default=0,
help='Whether to use kv int8')
@staticmethod
def rope_scaling_factor(parser):
"""Add argument rope_scaling_factor to parser."""
return parser.add_argument('--rope-scaling-factor',
type=float,
default=0.0,
help='Rope scaling factor')
@staticmethod
def use_logn_attn(parser):
"""Add argument use_logn_attn to parser."""
return parser.add_argument(
'--use-logn-attn',
action='store_true',
default=False,
help='Whether to use logn attention scaling')
@staticmethod
def block_size(parser):
"""Add argument block_size to parser."""
return parser.add_argument('--block-size',
type=int,
default=64,
help='The block size for paging cache')
@staticmethod
def top_p(parser):
"""Add argument top_p to parser."""
return parser.add_argument(
'--top-p',
type=float,
default=0.8,
help='An alternative to sampling with temperature,'
' called nucleus sampling, where the model '
'considers the results of the tokens with '
'top_p probability mass')
@staticmethod
def top_k(parser):
"""Add argument top_k to parser."""
return parser.add_argument(
'--top-k',
type=int,
default=1,
help='An alternative to sampling with temperature, '
'where the model considers the top_k tokens '
'with the highest probability')
@staticmethod
def temperature(parser, default: float = 0.8):
return parser.add_argument('-temp',
'--temperature',
type=float,
default=default,
help='Sampling temperature')
@staticmethod
def repetition_penalty(parser):
"""Add argument repetition_penalty to parser."""
return parser.add_argument('--repetition-penalty',
type=float,
default=1.0,
help='Parameter to penalize repetition')
@staticmethod
def cap(parser):
"""Add argument cap to parser."""
return parser.add_argument(
'--cap',
type=str,
default='chat',
choices=['completion', 'infilling', 'chat', 'python'],
help='The capability of a model. '
'Deprecated. Please use --chat-template instead')
@staticmethod
def log_level(parser):
"""Add argument log_level to parser."""
import logging
return parser.add_argument('--log-level',
type=str,
default='ERROR',
choices=list(logging._nameToLevel.keys()),
help='Set the log level')
@staticmethod
def api_keys(parser):
return parser.add_argument(
'--api-keys',
type=str,
nargs='*',
default=None,
help='Optional list of space separated API keys',
)
@staticmethod
def ssl(parser):
return parser.add_argument(
'--ssl',
action='store_true',
required=False,
default=False,
help='Enable SSL. Requires OS Environment variables'
" 'SSL_KEYFILE' and 'SSL_CERTFILE'",
)
@staticmethod
def backend(parser):
"""Add argument backend to parser."""
return parser.add_argument('--backend',
type=str,
default='turbomind',
choices=['pytorch', 'turbomind'],
help='Set the inference backend')
@staticmethod
def engine(parser):
"""Add argument engine to parser."""
return parser.add_argument('--engine',
type=str,
default='turbomind',
choices=['pytorch', 'turbomind'],
help='Set the inference backend')
@staticmethod
def stream_output(parser):
"""Add argument stream_output to parser."""
return parser.add_argument(
'--stream-output',
action='store_true',
help='Indicator for streaming output or not')
@staticmethod
def calib_dataset(parser):
"""Add argument calib_dataset to parser."""
return parser.add_argument('--calib-dataset',
type=str,
default='ptb',
help='The calibration dataset name')
@staticmethod
def calib_samples(parser):
"""Add argument calib_samples to parser."""
return parser.add_argument(
'--calib-samples',
type=int,
default=128,
help='The number of samples for calibration')
@staticmethod
def calib_seqlen(parser):
"""Add argument calib_seqlen to parser."""
return parser.add_argument('--calib-seqlen',
type=int,
default=2048,
help='The sequence length for calibration')
@staticmethod
def device(parser):
"""Add argument device to parser."""
return parser.add_argument('--device',
type=str,
default='cuda',
choices=['cuda', 'cpu'],
help='Device type of running')
@staticmethod
def meta_instruction(parser):
"""Add argument meta_instruction to parser."""
return parser.add_argument(
'--meta-instruction',
type=str,
default=None,
help='System prompt for ChatTemplateConfig. Deprecated. '
'Please use --chat-template instead')
@staticmethod
def chat_template(parser):
"""Add chat template config to parser."""
return parser.add_argument(
'--chat-template',
type=str,
default=None,
help=\
'A JSON file or string that specifies the chat template configuration. ' # noqa
'Please refer to https://lmdeploy.readthedocs.io/en/latest/advance/chat_template.html for the specification' # noqa
)
@staticmethod
def cache_max_entry_count(parser):
"""Add argument cache_max_entry_count to parser."""
return parser.add_argument(
'--cache-max-entry-count',
type=float,
default=0.8,
help='The percentage of gpu memory occupied by the k/v cache')
@staticmethod
def adapters(parser):
"""Add argument adapters to parser."""
return parser.add_argument(
'--adapters',
nargs='*',
type=str,
default=None,
help='Used to set path(s) of lora adapter(s). One can input '
'key-value pairs in xxx=yyy format for multiple lora '
'adapters. If only have one adapter, one can only input '
'the path of the adapter.')
@staticmethod
def work_dir(parser):
"""Add argument work_dir to parser."""
return parser.add_argument(
'--work-dir',
type=str,
default='./work_dir',
help='The working directory to save results')
# Copyright (c) OpenMMLab. All rights reserved.
# Copyright (c) OpenMMLab. All rights reserved.
"""Chat with torch models."""
# Copyright (c) OpenMMLab. All rights reserved.
import torch
class LoadNoInit:
"""Initialize model without parameter initialization."""
def __init__(self):
self.constant_ = torch.nn.init.constant_
self.zeros_ = torch.nn.init.zeros_
self.ones_ = torch.nn.init.ones_
self.uniform_ = torch.nn.init.uniform_
self.normal_ = torch.nn.init.normal_
self.kaiming_uniform_ = torch.nn.init.kaiming_uniform_
self.kaiming_normal_ = torch.nn.init.kaiming_normal_
def __enter__(self, *args, **kwargs):
"""Replace initializers with no-op."""
torch.nn.init.constant_ = lambda *args, **kwargs: None
torch.nn.init.zeros_ = lambda *args, **kwargs: None
torch.nn.init.ones_ = lambda *args, **kwargs: None
torch.nn.init.uniform_ = lambda *args, **kwargs: None
torch.nn.init.normal_ = lambda *args, **kwargs: None
torch.nn.init.kaiming_uniform_ = lambda *args, **kwargs: None
torch.nn.init.kaiming_normal_ = lambda *args, **kwargs: None
def __exit__(self, *args, **kwargs):
"""Recover."""
torch.nn.init.constant_ = self.constant_
torch.nn.init.zeros_ = self.zeros_
torch.nn.init.ones_ = self.ones_
torch.nn.init.uniform_ = self.uniform_
torch.nn.init.normal_ = self.normal_
torch.nn.init.kaiming_uniform_ = self.kaiming_uniform_
torch.nn.init.kaiming_normal_ = self.kaiming_normal_
# Copyright (c) OpenMMLab. All rights reserved.
import torch.nn as nn
from lmdeploy.utils import get_logger
from .base import BasicAdapter, BasicAdapterFast
from .internlm import InternLMAdapter
from .llama2 import Llama2Adapter
logger = get_logger(__name__)
def _get_default_adapter(tokenizer):
if tokenizer.is_fast:
return BasicAdapterFast
else:
return BasicAdapter
def init_adapter(model: nn.Module, tokenizer, adapter=None):
if adapter is None:
for v in model.modules():
if 'InternLMModel' in v.__class__.__name__:
Adapter = InternLMAdapter
break
elif 'LlamaModel' in v.__class__.__name__:
Adapter = Llama2Adapter
break
else:
Adapter = _get_default_adapter(tokenizer)
elif adapter == 'llama1':
Adapter = _get_default_adapter(tokenizer)
else:
raise ValueError(f'Adapter {adapter} is not allowed.')
logger.info(f'Using adapter {Adapter.__name__}')
return Adapter(tokenizer)
# Copyright (c) OpenMMLab. All rights reserved.
"""Basic adapter suitable for general HuggingFace models."""
import re
from transformers import (PreTrainedTokenizer, PreTrainedTokenizerBase,
PreTrainedTokenizerFast)
from lmdeploy.utils import get_logger
logger = get_logger(__name__)
class BaseAdapter:
"""Base class for all adapters.
Note:
Adapters coordinate with the session manager to prepare input_ids.
The full sequence fed to the model is as follows:
```
adapter.start_ids
adapter.encode_and_decorate(user_input_1)
output_1_generated_by_model
adapter.sep_ids
adapter.encode_and_decorate(user_input_2)
output_2_generated_by_model
adapter.sep_ids
adapter.encode_and_decorate(user_input_3)
```
Thus adapter is responsible for providing model specific
``start_ids``, ``sep_ids``, and method to encode single prompt.
"""
def __init__(self, tokenizer: PreTrainedTokenizerBase):
self.tokenizer = tokenizer
def encode_and_decorate(self, prompt, add_special_tokens=False):
"""Model specific method to encode and decorate prompt."""
raise NotImplementedError
def decode(self, value):
"""Model specific method to decode single value to string."""
raise NotImplementedError
@property
def stopping_criteria(self):
"""Model specific stopping criteria for generation."""
return None
@property
def start_ids(self):
"""Model specific start ids."""
return [self.tokenizer.bos_token_id]
@property
def sep_ids(self):
"""Model specific separation ids."""
return [self.tokenizer.bos_token_id]
class BasicAdapter(BaseAdapter):
"""Basic adapter for slow tokenizers."""
def encode_and_decorate(self, prompt, add_special_tokens=False):
"""Encode prompt.
Note:
we leave <bos> to session manager to add.
"""
input_ids = self.tokenizer.encode(
prompt,
add_special_tokens=add_special_tokens,
return_tensors='pt',
)
logger.debug(f'Encode {prompt} to {input_ids}')
return input_ids
def decode(self, value):
"""Fallback when tokenizer is not fast."""
self.tokenizer: PreTrainedTokenizer
tok = self.tokenizer.decode(value)
return tok + ' '
class BasicAdapterFast(BaseAdapter):
"""Basic adapter for slow tokenizers."""
hex_regex = re.compile(r'^<0x([0-9ABCDEF]+)>$')
def encode_and_decorate(self, prompt, add_special_tokens=False):
"""Encode prompt.
Note:
we leave <bos> to session manager to add.
"""
input_ids = self.tokenizer.encode(
prompt,
add_special_tokens=add_special_tokens,
return_tensors='pt',
)
logger.debug(f'Encode {prompt} to {input_ids}')
return input_ids
def decode(self, value):
"""Decode with fast tokenizers."""
self.tokenizer: PreTrainedTokenizerFast
tok = self.tokenizer._convert_id_to_token(value)
if tok.startswith('▁'): # sentencepiece
space = ' '
tok = tok[1:]
else:
space = ''
if res := self.hex_regex.match(tok):
tok = chr(int(res.group(1), 16))
if tok == '</s>' or tok == '\r':
tok = '\n'
tok = space + tok
logger.debug(f'Decode {value} to {repr(tok)}')
return tok
# Copyright (c) OpenMMLab. All rights reserved.
import re
import torch
from transformers import (PreTrainedTokenizerFast, StoppingCriteria,
StoppingCriteriaList)
from lmdeploy.utils import get_logger
from .base import BaseAdapter
logger = get_logger(__name__)
class InternLMStoppingCriteria(StoppingCriteria):
"""Stopping criteria for HF version of InternLM."""
def __call__(self, input_ids, *args, **kwargs) -> bool:
return input_ids[0, -1] in [2, 103028]
class InternLMAdapter(BaseAdapter):
"""Adapter for InternLM.
InternLM use the following template and \n should be 13.
<bos> (no actual newline here, just for better readability)
<|User|>:{prompt}<eoh>\n
<|Bot|>:{model_output}<eoa>\n
<|User|>:{prompt}<eoh>\n
<|Bot|>:{model_output}<eoa>\n
...
<eos>
"""
hex_regex = re.compile(r'^<0x([0-9ABCDEF]+)>$')
# ids of '<|User|>:'
B_USER_ID = torch.tensor([[333, 352, 1621, 352, 27232]])
# ids of '<eoh>\n<|Bot|>:'
E_USER_ID = torch.tensor([[103027, 13, 333, 352, 23845, 352, 27232]])
# ids of '<bos>'
start_ids = [1]
# ids of '\n'
sep_ids = [13]
def __init__(self, tokenizer: PreTrainedTokenizerFast):
self.tokenizer = tokenizer
def encode_and_decorate(self, prompt):
r"""Encode prompt and decorate with template.
Note:
we leave <bos> and chat history for session manager to add,
so we will decorate input_ids to '<|User|>:{prompt}<eoh>\n<|Bot|>:'
"""
input_ids = self.tokenizer.encode(
prompt,
add_special_tokens=False,
return_tensors='pt',
)
# This is f'<|User|>:{prompt}<eoh>\n<|Bot|>:'
# but force \n to 13 instead of 364
input_ids = torch.cat([self.B_USER_ID, input_ids, self.E_USER_ID],
dim=1)
return input_ids
def decode(self, value):
"""Decode generated tokens for InternLM."""
tok = self.tokenizer.decode(value)
if res := self.hex_regex.match(tok):
tok = chr(int(res.group(1), 16))
if tok == '</s>' or tok == '<eoa>' or tok == '\r':
tok = '\n'
logger.debug(f'Decode {value} to {repr(tok)}')
return tok
@property
def stopping_criteria(self):
return StoppingCriteriaList([InternLMStoppingCriteria()])
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment