Unverified Commit 7b470f07 authored by lvhan028's avatar lvhan028 Committed by GitHub
Browse files

Refactor the chat template of supported models using factory pattern (#144)

* refactor model.py and support baichuan-7b

* remove model_name

* remove hard session_len

* export tokenizer.py to target dir

* remove model_name from client

* remove model_name

* update

* correct throughput equation

* fix session.response

* update serving.md

* update readme

* update according to review comments

* update

* update

* update

* update
parent 2067862d
......@@ -54,7 +54,7 @@ The throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% highe
Below are quick steps for installation:
```shell
conda create -n lmdeploy python=3.10
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
......@@ -77,7 +77,7 @@ git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-cha
GIT_LFS_SKIP_SMUDGE=1
# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b hf
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
```
......@@ -85,11 +85,11 @@ python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b
```shell
docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
python3 -m lmdeploy.turbomind.chat internlm /workspace
python3 -m lmdeploy.turbomind.chat /workspace
```
```{note}
When inferring with FP16 precision, the InternLM-7B model requires at least 22.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
```
#### Serving
......@@ -103,7 +103,7 @@ bash workspace/service_docker_up.sh
Then, you can communicate with the inference server by command line,
```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337 internlm
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
```
or webui,
......@@ -114,7 +114,7 @@ python3 -m lmdeploy.app {server_ip_addresss}:33337 internlm
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
For the deployment of other supported models, such as LLaMA, vicuna, you can find the guide from [here](docs/en/serving.md)
For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
### Inference with PyTorch
......
......@@ -53,7 +53,7 @@ TurboMind 的吞吐量超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15
### 安装
```shell
conda create -n lmdeploy python=3.10
conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy
git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy
......@@ -76,7 +76,7 @@ git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-cha
GIT_LFS_SKIP_SMUDGE=1
# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b hf
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
```
......@@ -84,11 +84,11 @@ python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b
```shell
docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
python3 -m lmdeploy.turbomind.chat internlm /workspace
python3 -m lmdeploy.turbomind.chat /workspace
```
```{note}
turbomind 在使用 FP16 精度推理 InternLM-7B 模型时,显存开销至少需要 22.7G。建议使用 3090, V100,A100等型号的显卡
turbomind 在使用 FP16 精度推理 InternLM-7B 模型时,显存开销至少需要 15.7G。建议使用 3090, V100,A100等型号的显卡
```
#### 部署推理服务
......@@ -102,18 +102,18 @@ bash workspace/service_docker_up.sh
你可以通过命令行方式与推理服务进行对话:
```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337 internlm
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
```
也可以通过 WebUI 方式来对话:
```shell
python3 -m lmdeploy.app {server_ip_addresss}:33337 internlm
python3 -m lmdeploy.app {server_ip_addresss}:33337
```
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
其他模型的部署方式,比如 LLaMA,vicuna,请参考[这里](docs/zh_cn/serving.md)
其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)
### 基于 PyTorch 的推理
......
......@@ -2,7 +2,7 @@
We provide several profiling tools to benchmark our models.
## profiling with dataset
## profile with dataset
Download the dataset below or create your own dataset.
......@@ -16,7 +16,6 @@ Profiling your model with `profile_throughput.py`
python profile_throughput.py \
ShareGPT_V3_unfiltered_cleaned_split.json \
/path/to/your/model \
${ModelType} \
--concurrency 64
```
......@@ -27,7 +26,6 @@ python profile_throughput.py \
```bash
python profile_generation.py \
/path/to/your/model \
${ModelType} \
--concurrency 8 --input_seqlen 0 --output_seqlen 2048
```
......@@ -36,10 +34,11 @@ python profile_generation.py \
Tools above profile models with Python API. `profile_serving.py` is used to do benchmark on serving.
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python profile_serving.py \
${TritonServerAddress} \
${ModelName} \
/path/to/tokenizer \
/path/to/dataset \
ShareGPT_V3_unfiltered_cleaned_split.json \
--concurrency 64
```
......@@ -7,7 +7,6 @@ from threading import Thread
import fire
import numpy as np
from lmdeploy.model import MODELS
from lmdeploy.turbomind import Tokenizer, TurboMind
......@@ -74,16 +73,13 @@ def warmup(model, concurrency: int, output_seqlen: int, warmup_round: int = 4):
def main(model_path: str,
model_name: str,
concurrency: int = 1,
input_seqlen: int = 0,
output_seqlen: int = 512,
test_round: int = 10):
tokenizer_model_path = osp.join(model_path, 'triton_models', 'tokenizer')
tokenizer = Tokenizer(tokenizer_model_path)
model = MODELS.get(model_name)()
stop_words = model.stop_words
tm_model = TurboMind(model_path=model_path, stop_words=stop_words)
tm_model = TurboMind(model_path=model_path)
warmup(tm_model, concurrency, output_seqlen)
......@@ -127,7 +123,8 @@ def main(model_path: str,
token_latency_min = np.min(stats[:, 2], axis=0)
token_latency_max = np.max(stats[:, 2], axis=0)
token_latency_ave = np.mean(stats[:, 2], axis=0)
throughput = np.sum(stats[:, 1], axis=0) / np.sum(stats[:, 2], axis=0)
throughput = np.sum(stats[:, 1], axis=0) / np.sum(stats[:, 2],
axis=0) * concurrency
print(f'\n{"-" * 50}\nconcurrency: {concurrency}, input_tokens: '
f'{input_seqlen}, output_tokens: {output_seqlen}\n'
f'elapsed_time: {elapsed_time:.2f}s\n'
......@@ -136,7 +133,7 @@ def main(model_path: str,
f'{first_token_latency_ave:.2f}s\ntoken latency(min, max, ave): '
f'{token_latency_min:.2f}s, {token_latency_max:.2f}s, '
f'{token_latency_ave:.2f}s\n'
f'throughput per threads: {throughput} token/s\n{"-" * 50}')
f'throughput: {throughput} token/s\n{"-" * 50}')
if __name__ == '__main__':
......
......@@ -53,29 +53,25 @@ def infer(chatbot, session_id: int, req_que: mp.Queue, res_que: mp.Queue):
def warmup(tritonserver_addr: str,
model_name: str,
concurrency: int,
session_len: int,
output_seqlen: int,
warmup_round: int = 4):
print('start to warmup ...')
def _infer(_chatbot, session_id):
for _ in range(warmup_round):
for _, _, _ in chatbot.stream_infer(
for _, _, _ in _chatbot.stream_infer(
session_id,
prompt='',
request_output_len=output_seqlen,
sequence_start=True,
sequence_end=True):
continue
chatbot.reset_session()
_chatbot.reset_session()
_start = time.perf_counter()
chatbots = [
Chatbot(tritonserver_addr=tritonserver_addr,
model_name=model_name,
session_len=session_len,
ignore_eos=True,
profile_generation=True) for _ in range(concurrency)
]
......@@ -90,8 +86,8 @@ def warmup(tritonserver_addr: str,
print(f'end warmup, elapsed time: {round(_end - _start, 2)} s')
def read_dataset(tritonserver_addr, tokenizer_path: str, dataset_path: str,
samples: int, test_round: int, session_len: int):
def read_dataset(tokenizer_path: str, dataset_path: str, samples: int,
test_round: int, session_len: int):
start = time.perf_counter()
with open(dataset_path) as f:
dataset = json.load(f)
......@@ -134,24 +130,20 @@ def read_dataset(tritonserver_addr, tokenizer_path: str, dataset_path: str,
def main(tritonserver_addr: str,
model_name: str,
tokenizer_path: str,
dataset_path: str,
concurrency: int = 1,
session_len: int = 2048,
samples: int = 2000,
samples: int = 1000,
test_round: int = 1):
warmup(tritonserver_addr, model_name, concurrency, session_len,
session_len)
req_que = read_dataset(tritonserver_addr, tokenizer_path, dataset_path,
samples, test_round, session_len)
warmup(tritonserver_addr, concurrency, session_len - 1)
req_que = read_dataset(tokenizer_path, dataset_path, samples, test_round,
session_len)
res_que = mp.Queue()
procs = []
_start = time.perf_counter()
for i in range(concurrency):
chatbot = Chatbot(tritonserver_addr=tritonserver_addr,
model_name=model_name,
session_len=session_len,
display=False,
profile_serving=True,
ignore_eos=True)
......
......@@ -8,7 +8,6 @@ from typing import List, Tuple
import fire
from lmdeploy.model import MODELS
from lmdeploy.turbomind import Tokenizer, TurboMind
......@@ -55,13 +54,11 @@ def sample_requests(
class Engine:
def __init__(self, model_path: str, model_name: str):
def __init__(self, model_path: str):
tokenizer_model_path = osp.join(model_path, 'triton_models',
'tokenizer')
tokenizer = Tokenizer(tokenizer_model_path)
model = MODELS.get(model_name)()
stop_words = model.stop_words
tm_model = TurboMind(model_path=model_path, stop_words=stop_words)
tm_model = TurboMind(model_path=model_path)
self.tm_model = tm_model
self.tokenizer = tokenizer
......@@ -119,11 +116,10 @@ class Engine:
def main(dataset: str,
model_path: str,
model_name: str,
concurrency: int = 1,
num_prompts: int = 1000):
engine = Engine(model_path, model_name)
engine = Engine(model_path)
tokenizer = engine.tokenizer
requests = sample_requests(dataset, num_prompts, tokenizer)
......
# Serving a model
## Serving [LLaMA-2](https://github.com/facebookresearch/llama)
You can download [llama-2 models from huggingface](https://huggingface.co/meta-llama) and serve them like below:
<details open>
<summary><b>7B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-7b-chat-hf
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>13B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-13b-chat-hf --tp 2
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>70B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-70b-chat-hf --tp 8
bash workspace/service_docker_up.sh
```
</details>
## Serving [LLaMA](https://github.com/facebookresearch/llama)
Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)
......@@ -8,7 +42,7 @@ Weights for the LLaMA models can be obtained from by filling out [this form](htt
<summary><b>7B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-7B /path/to/llama-7b llama \
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh
```
......@@ -19,7 +53,7 @@ bash workspace/service_docker_up.sh
<summary><b>13B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh
```
......@@ -30,7 +64,7 @@ bash workspace/service_docker_up.sh
<summary><b>30B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-32B /path/to/llama-30b llama \
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-30b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh
```
......@@ -41,7 +75,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh
```
......@@ -60,7 +94,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1
python3 -m lmdeploy.serve.turbomind.deploy vicuna-7B /path/to/vicuna-7b hf
python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-7b
bash workspace/service_docker_up.sh
```
......@@ -76,7 +110,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1
python3 -m lmdeploy.serve.turbomind.deploy vicuna-13B /path/to/vicuna-13b hf
python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-13b
bash workspace/service_docker_up.sh
```
......
# 模型服务
## 部署 [LLaMA-2](https://github.com/facebookresearch/llama) 服务
请从[这里](https://huggingface.co/meta-llama) 下载 llama2 模型,参考如下命令部署服务:
<details open>
<summary><b>7B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-7b-chat-hf
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>13B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-13b-chat-hf --tp 2
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>70B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-70b-chat-hf --tp 8
bash workspace/service_docker_up.sh
```
</details>
## 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务
请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform),获取 LLaMA 模型权重
......@@ -8,7 +42,7 @@
<summary><b>7B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-7B /path/to/llama-7b llama \
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh
```
......@@ -19,7 +53,7 @@ bash workspace/service_docker_up.sh
<summary><b>13B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh
```
......@@ -30,7 +64,7 @@ bash workspace/service_docker_up.sh
<summary><b>30B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-32B /path/to/llama-30b llama \
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-30b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh
```
......@@ -41,7 +75,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \
python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh
```
......@@ -60,7 +94,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1
python3 -m lmdeploy.serve.turbomind.deploy vicuna-7B /path/to/vicuna-7b hf
python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-7b
bash workspace/service_docker_up.sh
```
......@@ -76,7 +110,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1
python3 -m lmdeploy.serve.turbomind.deploy vicuna-13B /path/to/vicuna-13b hf
python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-13b
bash workspace/service_docker_up.sh
```
......
......@@ -101,28 +101,26 @@ def cancel_func(
def run(triton_server_addr: str,
model_name: str,
server_name: str = 'localhost',
server_port: int = 6006):
"""chat with AI assistant through web ui.
Args:
triton_server_addr (str): the communication address of inference server
model_name (str): the name of the deployed model
server_name (str): the ip address of gradio server
server_port (int): the port of gradio server
"""
with gr.Blocks(css=CSS, theme=THEME) as demo:
log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
_chatbot = Chatbot(triton_server_addr,
log_level=log_level,
display=True)
model_name = _chatbot.model_name
chat_interface = partial(chat_stream, model_name=model_name)
reset_all = partial(reset_all_func,
model_name=model_name,
triton_server_addr=triton_server_addr)
log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
llama_chatbot = gr.State(
Chatbot(triton_server_addr,
model_name,
log_level=log_level,
display=True))
llama_chatbot = gr.State(_chatbot)
state_chatbot = gr.State([])
with gr.Column(elem_id='container'):
......
......@@ -4,11 +4,43 @@ from mmengine import Registry
MODELS = Registry('model', locations=['lmdeploy.model'])
@MODELS.register_module(name='llama')
class BaseModel:
"""Base model."""
def __init__(self):
self.session_len = 2048
self.top_p = 0.8
self.top_k = None
self.temperature = 0.8
self.repetition_penalty = 1.0
@staticmethod
def get_prompt(prompt, sequence_start=True):
"""Return the prompt that is concatenated with other elements in the
chat template.
Args:
prompt (str): user's input prompt
sequence_start (bool): indicator for the first round chat of a
session sequence
Returns:
str: the concatenated prompt
"""
return prompt
@property
def stop_words(self):
"""Return the stop-words' token ids."""
return None
@MODELS.register_module(name='vicuna')
class Vicuna:
class Vicuna(BaseModel):
"""Chat template of vicuna model."""
def __init__(self):
super().__init__()
self.system = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. """ # noqa: E501
self.user = 'USER'
self.assistant = 'ASSISTANT'
......@@ -29,17 +61,20 @@ class Vicuna:
else:
return f'</s>{self.user}: {prompt} {self.assistant}:'
@property
def stop_words(self):
"""Return the stop-words' token ids."""
return None
@MODELS.register_module(name='internlm')
class InternLM:
class InternLM(BaseModel):
def __init__(self):
super().__init__()
@MODELS.register_module(name='internlm-chat-7b')
class InternLMChat7B(BaseModel):
"""Chat template of InternLM model."""
def __init__(self):
super().__init__()
self.system = ''
self.user = '<|User|>'
self.eoh = '<eoh>'
......@@ -70,38 +105,21 @@ class InternLM:
return [103027, 103028]
@MODELS.register_module(name='llama')
class Llama:
"""Chat template of LLaMA model."""
@MODELS.register_module(name='internlm-chat-7b-8k')
class InternLMChat7B8K(InternLMChat7B):
def __init__(self):
pass
def get_prompt(self, prompt, sequence_start=True):
"""Return the prompt that is concatenated with other elements in the
chat template.
Args:
prompt (str): user's input prompt
sequence_start (bool): indicator for the first round chat of a
session sequence
Returns:
str: the concatenated prompt
"""
return prompt
@property
def stop_words(self):
"""Return the stop-words' token ids."""
return None
super(InternLMChat7B8K, self).__init__()
self.session_len = 8192
@MODELS.register_module(name='puyu')
class Puyu:
class Puyu(BaseModel):
"""Chat template of puyu model.This is only for internal usage in Shanghai
AI Laboratory."""
def __init__(self):
super().__init__()
self.system = """meta instruction
You are an AI assistant whose name is InternLM (书生·浦语).
- 书生·浦语 is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
......@@ -125,12 +143,20 @@ conversation""" # noqa: E501
return [45623]
@MODELS.register_module(name='baichuan-7b')
class Baichuan7B(BaseModel):
def __init__(self):
super().__init__()
self.repetition_penalty = 1.1
@MODELS.register_module(name='llama2')
class Llama2:
class Llama2(BaseModel):
"""Chat template of LLaMA2 model."""
def __init__(self):
super().__init__()
B_INST, E_INST = '[INST]', '[/INST]'
B_SYS, E_SYS = '<<SYS>>\n', '\n<</SYS>>\n\n'
......@@ -144,6 +170,7 @@ If a question does not make any sense, or is not factually coherent, explain why
self.b_sys = B_SYS
self.e_sys = E_SYS
self.default_sys_prompt = DEFAULT_SYSTEM_PROMPT
self.session_len = 4096
def get_prompt(self, prompt, sequence_start=True):
"""Return the prompt that is concatenated with other elements in the
......@@ -163,19 +190,15 @@ If a question does not make any sense, or is not factually coherent, explain why
return f'{self.b_inst} {prompt} {self.e_inst} '
@property
def stop_words(self):
"""Return the stop-words' token ids."""
return None
def main(model_name: str = 'test'):
assert model_name in MODELS.module_dict.keys(), \
f"'{model_name}' is not supported. " \
f'The supported models are: {MODELS.module_dict.keys()}'
model = MODELS.get('vicuna--1')()
model = MODELS.get(model_name)()
prompt = model.get_prompt(prompt='hi')
print(prompt)
print(f'session_len: {model.session_len}')
if __name__ == '__main__':
......
......@@ -13,7 +13,7 @@ def input_prompt():
return '\n'.join(iter(input, sentinel))
def main(tritonserver_addr: str, model_name: str, session_id: int = 1):
def main(tritonserver_addr: str, session_id: int = 1):
"""An example to communicate with inference server through the command line
interface.
......@@ -24,10 +24,7 @@ def main(tritonserver_addr: str, model_name: str, session_id: int = 1):
session_id (int): the identical id of a session
"""
log_level = os.environ.get('SERVICE_LOG_LEVEL', 'WARNING')
chatbot = Chatbot(tritonserver_addr,
model_name,
log_level=log_level,
display=True)
chatbot = Chatbot(tritonserver_addr, log_level=log_level, display=True)
nth_round = 1
while True:
prompt = input_prompt()
......
......@@ -64,15 +64,6 @@ class Chatbot:
tritonserver_addr (str): communicating address '<ip>:<port>' of
triton inference server
model_name (str): name of the to-be-deployed mode
session_len (int): the maximum context length of the model
top_p (float): If set to float < 1, only the smallest set of most
probable tokens with probabilities that add up to top_p or higher
are kept for generation.
top_k (int): The number of the highest probability vocabulary tokens to
keep for top-k-filtering
temperature (float): to modulate the next token probability
repetition_penalty (float): The parameter for repetition penalty.
1.0 means no penalty
log_level (int): the level of the log
display (bool): display the generated text on consolo or not
profile_generation (bool): profile token generation or not
......@@ -80,24 +71,18 @@ class Chatbot:
def __init__(self,
tritonserver_addr: str,
model_name: str,
session_len: int = 2048,
top_p: float = 0.8,
top_k: int = None,
temperature: float = 0.8,
repetition_penalty: float = 1.0,
ignore_eos: bool = False,
log_level: int = logging.INFO,
display: bool = False,
profile_generation: bool = False,
profile_serving: bool = False):
assert model_name in MODELS.module_dict.keys(), \
f"'{model_name}' is not supported. " \
self.tritonserver_addr = tritonserver_addr
self.model_name = self._get_model_name()
assert self.model_name in MODELS.module_dict.keys(), \
f"'{self.model_name}' is not supported. " \
f'The supported models are: {MODELS.module_dict.keys()}'
self.model_name = model_name
self.model = MODELS.get(self.model_name)()
self._session = None
self.tritonserver_addr = tritonserver_addr
self.preprocess = Preprocessor(tritonserver_addr)
self.postprocess = Postprocessor(tritonserver_addr)
self.bos_id = self._get_bos()
......@@ -108,11 +93,11 @@ class Chatbot:
stop_words = None
bad_words = np.array([[[self.eos_id], [1]]], dtype=np.int32)
self.cfg = mmengine.Config(
dict(session_len=session_len,
top_p=top_p,
top_k=top_k,
temperature=temperature,
repetition_penalty=repetition_penalty,
dict(session_len=self.model.session_len,
top_p=self.model.top_p,
top_k=self.model.top_k,
temperature=self.model.temperature,
repetition_penalty=self.model.repetition_penalty,
stop_words=stop_words,
bad_words=bad_words))
self.log_level = log_level
......@@ -167,12 +152,16 @@ class Chatbot:
request_output_len,
sequence_start,
sequence_end):
yield status, res, tokens
if status.value < 0:
return
self._session.histories = \
self._session.histories + self._session.prompt + \
self._session.response
break
else:
yield status, res, tokens
if status.value == 0:
self._session.histories = \
self._session.histories + self._session.prompt + \
self._session.response
else:
yield status, res, tokens
def end(self, session_id: int, *args, **kwargs):
"""end a session. Triton inference server will release the session's
......@@ -208,11 +197,11 @@ class Chatbot:
request_output_len=0,
sequence_start=False,
sequence_end=True):
if status != StatusCode.TRITON_STREAM_END:
return status
if status.value < 0:
break
self.reset_session()
return StatusCode.TRITON_STREAM_END
return status
def cancel(self, session_id: int, *args, **kwargs):
"""Cancel the session during generating tokens.
......@@ -243,6 +232,7 @@ class Chatbot:
return StatusCode.TRITON_SESSION_CLOSED
prev_session = self._session
status, res = None, None
for status, res, _ in self._stream_infer(self._session,
prompt='',
request_output_len=0,
......@@ -254,7 +244,7 @@ class Chatbot:
if status == StatusCode.TRITON_STREAM_END:
logger.info(f'cancel session {session_id} successfully')
if prev_session.histories:
logger.warn(f'TODO: start to recover session {session_id}')
logger.warning(f'TODO: start to recover session {session_id}')
else:
logger.info(f'cancel session {session_id} failed: {res}')
return status
......@@ -295,7 +285,7 @@ class Chatbot:
sequence_start=True,
sequence_end=False):
if status.value < 0:
return status
break
self._session.histories = histories
return status
......@@ -314,6 +304,14 @@ class Chatbot:
"""set session."""
self._session = value
def _get_model_name(self):
with grpcclient.InferenceServerClient(
self.tritonserver_addr) as client:
model_config = client.get_model_config(model_name='turbomind',
as_json=True)
return model_config['config']['parameters']['model_name'][
'string_value']
def _get_bos(self):
"""return bos token id."""
token_ids, _ = self.preprocess('<BOS>')
......@@ -422,16 +420,12 @@ class Chatbot:
request_output_len, sequence_start,
sequence_end, preseq_length, cancel))
producer.start()
for state, res, tokens in self.stream_consumer(self.postprocess, que,
session, input_tokens,
preseq_length, cancel,
logger, self.display,
self.profile_generation,
self.eos_id):
if state.value < 0:
yield state, res, 0
else:
yield state, res, tokens
for status, res, n_token in self.stream_consumer(
self.postprocess, que, session, input_tokens, preseq_length,
cancel, logger, self.display, self.profile_generation,
self.eos_id):
yield status, res, n_token
producer.join()
self._session = que.get()
curseq_length = self._session.sequence_length
......@@ -543,11 +537,13 @@ class Chatbot:
tuple: status, text, generated token number
"""
offset = n_input_token + preseq_length
status, res, n_token = None, '', 0
while True:
result = res_queue.get()
if result is None:
yield (StatusCode.TRITON_STREAM_END, session.response,
session.sequence_length - offset)
status = StatusCode.TRITON_STREAM_END
res = session.response
n_token = session.sequence_length - offset
session.status = StatusCode.TRITON_STREAM_END
break
if 'errcode' in result:
......@@ -555,7 +551,10 @@ class Chatbot:
f"{result['errcode']}, {result['errmsg']}, "
f'token {session.sequence_length}')
session.sequence_length = preseq_length
yield result['errcode'], result['errmsg'], 0
session.response = ''
status = StatusCode.TRITON_SERVER_ERR
res = f"{result['errcode']}, {result['errmsg']}"
n_token = 0
break
if cancel:
continue
......@@ -601,3 +600,4 @@ class Chatbot:
res_queue.put(session)
if display:
print('\n')
yield status, res, n_token
......@@ -12,9 +12,16 @@ import safetensors
import torch
from sentencepiece import SentencePieceProcessor
from lmdeploy.model import MODELS
supported_formats = ['llama', 'hf']
def get_package_root_path():
import importlib.resources as pkg_resources
return pkg_resources.path('lmdeploy', '')
def create_workspace(_path: str):
"""Create a workspace.
......@@ -164,6 +171,7 @@ def export(model_name: str,
save_bin(param_data, param_name)
# export config and save it to {out_dir}/config.ini
model = MODELS.get(model_name)()
vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
assert _vocab_size >= vocab_size, \
f'different vocab size {_vocab_size} vs {vocab_size}'
......@@ -184,7 +192,7 @@ def export(model_name: str,
# parameters for turbomind
max_batch_size=32,
max_context_token_num=4,
session_len=2056,
session_len=model.session_len + 8,
step_length=1,
cache_max_entry_count=48,
cache_chunk_size=1,
......@@ -226,6 +234,9 @@ def deploy_llama(model_name: str, model_path: str, tokenizer_path: str,
if osp.exists(tokenizer_path):
shutil.copy(tokenizer_path,
osp.join(triton_models_path, 'tokenizer/tokenizer.model'))
with get_package_root_path() as root_path:
shutil.copy(osp.join(root_path, 'turbomind/tokenizer.py'),
osp.join(triton_models_path, 'tokenizer'))
else:
print(f'tokenizer model {tokenizer_path} does not exist')
return False
......@@ -352,6 +363,9 @@ def deploy_hf(model_name: str, model_path: str, tokenizer_path: str,
json_path = osp.join(model_path, _file)
shutil.copy(json_path,
osp.join(triton_models_path, 'tokenizer', _file))
with get_package_root_path() as root_path:
shutil.copy(osp.join(root_path, 'turbomind/tokenizer.py'),
osp.join(triton_models_path, 'tokenizer'))
else:
print(f'tokenizer model {tokenizer_path} does not exist')
exit(-1)
......@@ -495,7 +509,7 @@ def pack_model_repository(workspace_path: str):
def main(model_name: str,
model_path: str,
model_format: str,
model_format: str = 'hf',
tokenizer_path: str = None,
dst_path: str = './workspace',
tp: int = 1):
......@@ -511,6 +525,9 @@ def main(model_name: str,
dst_path (str): the destination path that saves outputs
tp (int): the number of GPUs used for tensor parallelism
"""
assert model_name in MODELS.module_dict.keys(), \
f"'{model_name}' is not supported. " \
f'The supported models are: {MODELS.module_dict.keys()}'
if model_format not in supported_formats:
print(f'the model format "{model_format}" is not supported. '
......@@ -539,8 +556,11 @@ def main(model_name: str,
# update `tensor_para_size` in `triton_models/interactive/config.pbtxt`
with open(osp.join(triton_models_path, 'interactive/config.pbtxt'),
'a') as f:
param = 'parameters {\n key: "tensor_para_size"\n value: {\n ' \
'string_value: ' + f'"{tp}"\n' + ' }\n}\n'
param = \
'parameters {\n key: "tensor_para_size"\n value: {\n ' \
'string_value: ' + f'"{tp}"\n' + ' }\n}\n' + \
'parameters {\n key: "model_name"\n value: {\n ' \
'string_value: ' + f'"{model_name}"\n' + ' }\n}\n'
f.write(param)
if not res:
print(f'deploy model "{model_name}" via turbomind failed')
......
......@@ -2,88 +2,15 @@
import json
import os.path as osp
from pathlib import Path
from typing import List
import numpy as np
import triton_python_backend_utils as pb_utils
class Tokenizer:
"""Tokenize prompts or de-tokenize tokens into texts.
Args:
model_file (str): the path of the tokenizer model
"""
def __init__(self, model_file: str):
model_folder = osp.split(model_file)[0]
tokenizer_config_file = osp.join(model_folder, 'tokenizer_config.json')
use_hf_model = osp.exists(tokenizer_config_file)
self.use_hf_model = use_hf_model
if not self.use_hf_model:
from sentencepiece import SentencePieceProcessor
self.model = SentencePieceProcessor(model_file=model_file)
self.vocab_size = self.model.vocab_size()
self.start_id = self.model.bos_id()
self.end_id = self.model.eos_id()
else:
from transformers import AutoTokenizer
backend_tokenizer_file = osp.join(model_folder, 'tokenizer.json')
if not osp.exists(backend_tokenizer_file):
print('WARNING: Can not find tokenizer.json. '
'It may take long time to initialize the tokenizer.')
self.model = AutoTokenizer.from_pretrained(model_folder,
trust_remote_code=True)
self.vocab_size = self.model.vocab_size
self.start_id = self.model.bos_token_id
self.end_id = self.model.eos_token_id
# save tokenizer.json to reuse
if not osp.exists(backend_tokenizer_file) and \
hasattr(self.model, 'backend_tokenizer'):
self.model.backend_tokenizer.save(backend_tokenizer_file)
def encode(self, s: str):
"""Tokenize a prompt.
Args:
s (str): a prompt
Returns:
list[int]: token ids
"""
if not self.use_hf_model:
add_bos = False
add_eos = False
if s.find('<BOS>') != -1:
s = s.replace('<BOS>', '')
add_bos = True
if s == '<EOS>':
s = ''
add_eos = True
return self.model.Encode(s, add_bos=add_bos, add_eos=add_eos)
else:
add_special_tokens = False
if s.find('<BOS>') != -1:
s = s.replace('<BOS>', '<s>')
if s == '<EOS>':
s = '</s>'
if len(s) == 0:
add_special_tokens = True
return self.model.encode(s, add_special_tokens=add_special_tokens)
def decode(self, t: List[int]):
"""De-tokenize.
Args:
t (List[int]): a list of token ids
Returns:
str: text of decoding tokens
"""
if not self.use_hf_model:
return self.model.Decode(t)
else:
skip_special_tokens = False
return self.model.decode(t,
skip_special_tokens=skip_special_tokens)
# This tokenizer is `lmdeploy/turbomind/tokenizer.py`. When an LLM is served
# by triton inference server, it has to be converted first by running
# `python lmdeploy/serve/turbomind/deploy.py`. Then
# `lmdeploy/turbomind/tokenizer.py` will be copied to `tokenizer/tokenizer.py`
from .tokenizer.tokenizer import Tokenizer
class TritonPythonModel:
......
......@@ -2,90 +2,17 @@
import json
import os.path as osp
from pathlib import Path
from typing import List
import numpy as np
import torch
import triton_python_backend_utils as pb_utils
from torch.nn.utils.rnn import pad_sequence
class Tokenizer:
"""Tokenize prompts or de-tokenize tokens into texts.
Args:
model_file (str): the path of the tokenizer model
"""
def __init__(self, model_file: str):
model_folder = osp.split(model_file)[0]
tokenizer_config_file = osp.join(model_folder, 'tokenizer_config.json')
use_hf_model = osp.exists(tokenizer_config_file)
self.use_hf_model = use_hf_model
if not self.use_hf_model:
from sentencepiece import SentencePieceProcessor
self.model = SentencePieceProcessor(model_file=model_file)
self.vocab_size = self.model.vocab_size()
self.start_id = self.model.bos_id()
self.end_id = self.model.eos_id()
else:
from transformers import AutoTokenizer
backend_tokenizer_file = osp.join(model_folder, 'tokenizer.json')
if not osp.exists(backend_tokenizer_file):
print('WARNING: Can not find tokenizer.json. '
'It may take long time to initialize the tokenizer.')
self.model = AutoTokenizer.from_pretrained(model_folder,
trust_remote_code=True)
self.vocab_size = self.model.vocab_size
self.start_id = self.model.bos_token_id
self.end_id = self.model.eos_token_id
# save tokenizer.json to reuse
if not osp.exists(backend_tokenizer_file) and \
hasattr(self.model, 'backend_tokenizer'):
self.model.backend_tokenizer.save(backend_tokenizer_file)
def encode(self, s: str):
"""Tokenize a prompt.
Args:
s (str): a prompt
Returns:
list[int]: token ids
"""
if not self.use_hf_model:
add_bos = False
add_eos = False
if s.find('<BOS>') != -1:
s = s.replace('<BOS>', '')
add_bos = True
if s == '<EOS>':
s = ''
add_eos = True
return self.model.Encode(s, add_bos=add_bos, add_eos=add_eos)
else:
add_special_tokens = False
if s.find('<BOS>') != -1:
s = s.replace('<BOS>', '<s>')
if s == '<EOS>':
s = '</s>'
if len(s) == 0:
add_special_tokens = True
return self.model.encode(s, add_special_tokens=add_special_tokens)
def decode(self, t: List[int]):
"""De-tokenize.
Args:
t (List[int]): a list of token ids
Returns:
str: text of decoding tokens
"""
if not self.use_hf_model:
return self.model.Decode(t)
else:
skip_special_tokens = False
return self.model.decode(t,
skip_special_tokens=skip_special_tokens)
# This tokenizer is `lmdeploy/turbomind/tokenizer.py`. When an LLM is served
# by triton inference server, it has to be converted first by running
# `python lmdeploy/serve/turbomind/deploy.py`. Then
# `lmdeploy/turbomind/tokenizer.py` will be copied to `tokenizer/tokenizer.py`
from .tokenizer.tokenizer import Tokenizer
class TritonPythonModel:
......@@ -131,8 +58,8 @@ class TritonPythonModel:
osp.join(
cur_folder, self.model_config['parameters']['tokenizer_path']
['string_value']))
self.start_id = self.tokenizer.start_id
self.end_id = self.tokenizer.end_id
self.start_id = self.tokenizer.bos_token_id
self.end_id = self.tokenizer.eos_token_id
def execute(self, requests):
"""`execute` must be implemented in every Python model. `execute`
......
......@@ -29,29 +29,24 @@ def valid_str(string, coding='utf-8'):
return ret
def main(model_name,
model_path,
session_id: int = 1,
repetition_penalty: float = 1.0):
def main(model_path, session_id: int = 1, repetition_penalty: float = 1.0):
"""An example to perform model inference through the command line
interface.
Args:
model_name (str): the name of the deployed model
model_path (str): the path of the deployed model
session_id (int): the identical id of a session
"""
model = MODELS.get(model_name)()
tokenizer_model_path = osp.join(model_path, 'triton_models', 'tokenizer')
tokenizer = Tokenizer(tokenizer_model_path)
tm_model = tm.TurboMind(model_path,
eos_id=tokenizer.eos_token_id,
stop_words=model.stop_words)
tm_model = tm.TurboMind(model_path, eos_id=tokenizer.eos_token_id)
generator = tm_model.create_instance()
nth_round = 1
step = 0
seed = random.getrandbits(64)
model_name = tm_model.model_name
model = MODELS.get(model_name)()
while True:
prompt = input_prompt()
......
......@@ -12,6 +12,7 @@ import torch
from torch.nn.utils.rnn import pad_sequence
import lmdeploy
from lmdeploy.model import MODELS
# TODO: find another way import _turbomind
lmdeploy_dir = osp.split(lmdeploy.__file__)[0]
......@@ -70,14 +71,12 @@ class TurboMind:
model_path (str): the path of turbomind's model
data_type (str): the data type
eos_id (int): eos token id
stop_words (List[int]): token ids of stop-words
"""
def __init__(self,
model_path: str,
data_type: str = 'fp16',
eos_id: int = 2,
stop_words: List[int] = None):
eos_id: int = 2):
self.eos_id = eos_id
# TODO: support mpi
......@@ -101,6 +100,9 @@ class TurboMind:
self.gpu_count = parser.getint(section_name,
'tensor_para_size')
self.session_len = parser.getint(section_name, 'session_len')
self.model_name = parser.get(section_name, 'model_name')
model = MODELS.get(self.model_name)()
self.stop_words = _stop_words(model.stop_words)
# params
self.node_id = node_id
......@@ -129,8 +131,6 @@ class TurboMind:
for t in threads:
t.join()
self.stop_words = _stop_words(stop_words)
def create_instance(self, cuda_stream_id=0):
"""Create a turbomind instance.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment