Unverified Commit 7b470f07 authored by lvhan028's avatar lvhan028 Committed by GitHub
Browse files

Refactor the chat template of supported models using factory pattern (#144)

* refactor model.py and support baichuan-7b

* remove model_name

* remove hard session_len

* export tokenizer.py to target dir

* remove model_name from client

* remove model_name

* update

* correct throughput equation

* fix session.response

* update serving.md

* update readme

* update according to review comments

* update

* update

* update

* update
parent 2067862d
...@@ -54,7 +54,7 @@ The throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% highe ...@@ -54,7 +54,7 @@ The throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% highe
Below are quick steps for installation: Below are quick steps for installation:
```shell ```shell
conda create -n lmdeploy python=3.10 conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy conda activate lmdeploy
git clone https://github.com/InternLM/lmdeploy.git git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy cd lmdeploy
...@@ -77,7 +77,7 @@ git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-cha ...@@ -77,7 +77,7 @@ git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-cha
GIT_LFS_SKIP_SMUDGE=1 GIT_LFS_SKIP_SMUDGE=1
# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default # 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b hf python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
``` ```
...@@ -85,11 +85,11 @@ python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b ...@@ -85,11 +85,11 @@ python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b
```shell ```shell
docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \ docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
python3 -m lmdeploy.turbomind.chat internlm /workspace python3 -m lmdeploy.turbomind.chat /workspace
``` ```
```{note} ```{note}
When inferring with FP16 precision, the InternLM-7B model requires at least 22.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc. When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind. It is recommended to use NVIDIA cards such as 3090, V100, A100, etc.
``` ```
#### Serving #### Serving
...@@ -103,7 +103,7 @@ bash workspace/service_docker_up.sh ...@@ -103,7 +103,7 @@ bash workspace/service_docker_up.sh
Then, you can communicate with the inference server by command line, Then, you can communicate with the inference server by command line,
```shell ```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337 internlm python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
``` ```
or webui, or webui,
...@@ -114,7 +114,7 @@ python3 -m lmdeploy.app {server_ip_addresss}:33337 internlm ...@@ -114,7 +114,7 @@ python3 -m lmdeploy.app {server_ip_addresss}:33337 internlm
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab) ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
For the deployment of other supported models, such as LLaMA, vicuna, you can find the guide from [here](docs/en/serving.md) For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from [here](docs/en/serving.md)
### Inference with PyTorch ### Inference with PyTorch
......
...@@ -53,7 +53,7 @@ TurboMind 的吞吐量超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15 ...@@ -53,7 +53,7 @@ TurboMind 的吞吐量超过 2000 token/s, 整体比 DeepSpeed 提升约 5% - 15
### 安装 ### 安装
```shell ```shell
conda create -n lmdeploy python=3.10 conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy conda activate lmdeploy
git clone https://github.com/InternLM/lmdeploy.git git clone https://github.com/InternLM/lmdeploy.git
cd lmdeploy cd lmdeploy
...@@ -76,7 +76,7 @@ git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-cha ...@@ -76,7 +76,7 @@ git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-cha
GIT_LFS_SKIP_SMUDGE=1 GIT_LFS_SKIP_SMUDGE=1
# 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace # 2. 转换为 trubomind 要求的格式。默认存放路径为 ./workspace
python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b hf python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b
``` ```
...@@ -84,11 +84,11 @@ python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b ...@@ -84,11 +84,11 @@ python3 -m lmdeploy.serve.turbomind.deploy internlm-7b /path/to/internlm-chat-7b
```shell ```shell
docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \ docker run --gpus all --rm -v $(pwd)/workspace:/workspace -it openmmlab/lmdeploy:latest \
python3 -m lmdeploy.turbomind.chat internlm /workspace python3 -m lmdeploy.turbomind.chat /workspace
``` ```
```{note} ```{note}
turbomind 在使用 FP16 精度推理 InternLM-7B 模型时,显存开销至少需要 22.7G。建议使用 3090, V100,A100等型号的显卡 turbomind 在使用 FP16 精度推理 InternLM-7B 模型时,显存开销至少需要 15.7G。建议使用 3090, V100,A100等型号的显卡
``` ```
#### 部署推理服务 #### 部署推理服务
...@@ -102,18 +102,18 @@ bash workspace/service_docker_up.sh ...@@ -102,18 +102,18 @@ bash workspace/service_docker_up.sh
你可以通过命令行方式与推理服务进行对话: 你可以通过命令行方式与推理服务进行对话:
```shell ```shell
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337 internlm python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
``` ```
也可以通过 WebUI 方式来对话: 也可以通过 WebUI 方式来对话:
```shell ```shell
python3 -m lmdeploy.app {server_ip_addresss}:33337 internlm python3 -m lmdeploy.app {server_ip_addresss}:33337
``` ```
![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab) ![](https://github.com/InternLM/lmdeploy/assets/67539920/08d1e6f2-3767-44d5-8654-c85767cec2ab)
其他模型的部署方式,比如 LLaMA,vicuna,请参考[这里](docs/zh_cn/serving.md) 其他模型的部署方式,比如 LLaMA,LLaMA-2,vicuna等等,请参考[这里](docs/zh_cn/serving.md)
### 基于 PyTorch 的推理 ### 基于 PyTorch 的推理
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
We provide several profiling tools to benchmark our models. We provide several profiling tools to benchmark our models.
## profiling with dataset ## profile with dataset
Download the dataset below or create your own dataset. Download the dataset below or create your own dataset.
...@@ -16,7 +16,6 @@ Profiling your model with `profile_throughput.py` ...@@ -16,7 +16,6 @@ Profiling your model with `profile_throughput.py`
python profile_throughput.py \ python profile_throughput.py \
ShareGPT_V3_unfiltered_cleaned_split.json \ ShareGPT_V3_unfiltered_cleaned_split.json \
/path/to/your/model \ /path/to/your/model \
${ModelType} \
--concurrency 64 --concurrency 64
``` ```
...@@ -27,7 +26,6 @@ python profile_throughput.py \ ...@@ -27,7 +26,6 @@ python profile_throughput.py \
```bash ```bash
python profile_generation.py \ python profile_generation.py \
/path/to/your/model \ /path/to/your/model \
${ModelType} \
--concurrency 8 --input_seqlen 0 --output_seqlen 2048 --concurrency 8 --input_seqlen 0 --output_seqlen 2048
``` ```
...@@ -36,10 +34,11 @@ python profile_generation.py \ ...@@ -36,10 +34,11 @@ python profile_generation.py \
Tools above profile models with Python API. `profile_serving.py` is used to do benchmark on serving. Tools above profile models with Python API. `profile_serving.py` is used to do benchmark on serving.
```bash ```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python profile_serving.py \ python profile_serving.py \
${TritonServerAddress} \ ${TritonServerAddress} \
${ModelName} \
/path/to/tokenizer \ /path/to/tokenizer \
/path/to/dataset \ ShareGPT_V3_unfiltered_cleaned_split.json \
--concurrency 64 --concurrency 64
``` ```
...@@ -7,7 +7,6 @@ from threading import Thread ...@@ -7,7 +7,6 @@ from threading import Thread
import fire import fire
import numpy as np import numpy as np
from lmdeploy.model import MODELS
from lmdeploy.turbomind import Tokenizer, TurboMind from lmdeploy.turbomind import Tokenizer, TurboMind
...@@ -74,16 +73,13 @@ def warmup(model, concurrency: int, output_seqlen: int, warmup_round: int = 4): ...@@ -74,16 +73,13 @@ def warmup(model, concurrency: int, output_seqlen: int, warmup_round: int = 4):
def main(model_path: str, def main(model_path: str,
model_name: str,
concurrency: int = 1, concurrency: int = 1,
input_seqlen: int = 0, input_seqlen: int = 0,
output_seqlen: int = 512, output_seqlen: int = 512,
test_round: int = 10): test_round: int = 10):
tokenizer_model_path = osp.join(model_path, 'triton_models', 'tokenizer') tokenizer_model_path = osp.join(model_path, 'triton_models', 'tokenizer')
tokenizer = Tokenizer(tokenizer_model_path) tokenizer = Tokenizer(tokenizer_model_path)
model = MODELS.get(model_name)() tm_model = TurboMind(model_path=model_path)
stop_words = model.stop_words
tm_model = TurboMind(model_path=model_path, stop_words=stop_words)
warmup(tm_model, concurrency, output_seqlen) warmup(tm_model, concurrency, output_seqlen)
...@@ -127,7 +123,8 @@ def main(model_path: str, ...@@ -127,7 +123,8 @@ def main(model_path: str,
token_latency_min = np.min(stats[:, 2], axis=0) token_latency_min = np.min(stats[:, 2], axis=0)
token_latency_max = np.max(stats[:, 2], axis=0) token_latency_max = np.max(stats[:, 2], axis=0)
token_latency_ave = np.mean(stats[:, 2], axis=0) token_latency_ave = np.mean(stats[:, 2], axis=0)
throughput = np.sum(stats[:, 1], axis=0) / np.sum(stats[:, 2], axis=0) throughput = np.sum(stats[:, 1], axis=0) / np.sum(stats[:, 2],
axis=0) * concurrency
print(f'\n{"-" * 50}\nconcurrency: {concurrency}, input_tokens: ' print(f'\n{"-" * 50}\nconcurrency: {concurrency}, input_tokens: '
f'{input_seqlen}, output_tokens: {output_seqlen}\n' f'{input_seqlen}, output_tokens: {output_seqlen}\n'
f'elapsed_time: {elapsed_time:.2f}s\n' f'elapsed_time: {elapsed_time:.2f}s\n'
...@@ -136,7 +133,7 @@ def main(model_path: str, ...@@ -136,7 +133,7 @@ def main(model_path: str,
f'{first_token_latency_ave:.2f}s\ntoken latency(min, max, ave): ' f'{first_token_latency_ave:.2f}s\ntoken latency(min, max, ave): '
f'{token_latency_min:.2f}s, {token_latency_max:.2f}s, ' f'{token_latency_min:.2f}s, {token_latency_max:.2f}s, '
f'{token_latency_ave:.2f}s\n' f'{token_latency_ave:.2f}s\n'
f'throughput per threads: {throughput} token/s\n{"-" * 50}') f'throughput: {throughput} token/s\n{"-" * 50}')
if __name__ == '__main__': if __name__ == '__main__':
......
...@@ -53,29 +53,25 @@ def infer(chatbot, session_id: int, req_que: mp.Queue, res_que: mp.Queue): ...@@ -53,29 +53,25 @@ def infer(chatbot, session_id: int, req_que: mp.Queue, res_que: mp.Queue):
def warmup(tritonserver_addr: str, def warmup(tritonserver_addr: str,
model_name: str,
concurrency: int, concurrency: int,
session_len: int,
output_seqlen: int, output_seqlen: int,
warmup_round: int = 4): warmup_round: int = 4):
print('start to warmup ...') print('start to warmup ...')
def _infer(_chatbot, session_id): def _infer(_chatbot, session_id):
for _ in range(warmup_round): for _ in range(warmup_round):
for _, _, _ in chatbot.stream_infer( for _, _, _ in _chatbot.stream_infer(
session_id, session_id,
prompt='', prompt='',
request_output_len=output_seqlen, request_output_len=output_seqlen,
sequence_start=True, sequence_start=True,
sequence_end=True): sequence_end=True):
continue continue
chatbot.reset_session() _chatbot.reset_session()
_start = time.perf_counter() _start = time.perf_counter()
chatbots = [ chatbots = [
Chatbot(tritonserver_addr=tritonserver_addr, Chatbot(tritonserver_addr=tritonserver_addr,
model_name=model_name,
session_len=session_len,
ignore_eos=True, ignore_eos=True,
profile_generation=True) for _ in range(concurrency) profile_generation=True) for _ in range(concurrency)
] ]
...@@ -90,8 +86,8 @@ def warmup(tritonserver_addr: str, ...@@ -90,8 +86,8 @@ def warmup(tritonserver_addr: str,
print(f'end warmup, elapsed time: {round(_end - _start, 2)} s') print(f'end warmup, elapsed time: {round(_end - _start, 2)} s')
def read_dataset(tritonserver_addr, tokenizer_path: str, dataset_path: str, def read_dataset(tokenizer_path: str, dataset_path: str, samples: int,
samples: int, test_round: int, session_len: int): test_round: int, session_len: int):
start = time.perf_counter() start = time.perf_counter()
with open(dataset_path) as f: with open(dataset_path) as f:
dataset = json.load(f) dataset = json.load(f)
...@@ -134,24 +130,20 @@ def read_dataset(tritonserver_addr, tokenizer_path: str, dataset_path: str, ...@@ -134,24 +130,20 @@ def read_dataset(tritonserver_addr, tokenizer_path: str, dataset_path: str,
def main(tritonserver_addr: str, def main(tritonserver_addr: str,
model_name: str,
tokenizer_path: str, tokenizer_path: str,
dataset_path: str, dataset_path: str,
concurrency: int = 1, concurrency: int = 1,
session_len: int = 2048, session_len: int = 2048,
samples: int = 2000, samples: int = 1000,
test_round: int = 1): test_round: int = 1):
warmup(tritonserver_addr, model_name, concurrency, session_len, warmup(tritonserver_addr, concurrency, session_len - 1)
req_que = read_dataset(tokenizer_path, dataset_path, samples, test_round,
session_len) session_len)
req_que = read_dataset(tritonserver_addr, tokenizer_path, dataset_path,
samples, test_round, session_len)
res_que = mp.Queue() res_que = mp.Queue()
procs = [] procs = []
_start = time.perf_counter() _start = time.perf_counter()
for i in range(concurrency): for i in range(concurrency):
chatbot = Chatbot(tritonserver_addr=tritonserver_addr, chatbot = Chatbot(tritonserver_addr=tritonserver_addr,
model_name=model_name,
session_len=session_len,
display=False, display=False,
profile_serving=True, profile_serving=True,
ignore_eos=True) ignore_eos=True)
......
...@@ -8,7 +8,6 @@ from typing import List, Tuple ...@@ -8,7 +8,6 @@ from typing import List, Tuple
import fire import fire
from lmdeploy.model import MODELS
from lmdeploy.turbomind import Tokenizer, TurboMind from lmdeploy.turbomind import Tokenizer, TurboMind
...@@ -55,13 +54,11 @@ def sample_requests( ...@@ -55,13 +54,11 @@ def sample_requests(
class Engine: class Engine:
def __init__(self, model_path: str, model_name: str): def __init__(self, model_path: str):
tokenizer_model_path = osp.join(model_path, 'triton_models', tokenizer_model_path = osp.join(model_path, 'triton_models',
'tokenizer') 'tokenizer')
tokenizer = Tokenizer(tokenizer_model_path) tokenizer = Tokenizer(tokenizer_model_path)
model = MODELS.get(model_name)() tm_model = TurboMind(model_path=model_path)
stop_words = model.stop_words
tm_model = TurboMind(model_path=model_path, stop_words=stop_words)
self.tm_model = tm_model self.tm_model = tm_model
self.tokenizer = tokenizer self.tokenizer = tokenizer
...@@ -119,11 +116,10 @@ class Engine: ...@@ -119,11 +116,10 @@ class Engine:
def main(dataset: str, def main(dataset: str,
model_path: str, model_path: str,
model_name: str,
concurrency: int = 1, concurrency: int = 1,
num_prompts: int = 1000): num_prompts: int = 1000):
engine = Engine(model_path, model_name) engine = Engine(model_path)
tokenizer = engine.tokenizer tokenizer = engine.tokenizer
requests = sample_requests(dataset, num_prompts, tokenizer) requests = sample_requests(dataset, num_prompts, tokenizer)
......
# Serving a model # Serving a model
## Serving [LLaMA-2](https://github.com/facebookresearch/llama)
You can download [llama-2 models from huggingface](https://huggingface.co/meta-llama) and serve them like below:
<details open>
<summary><b>7B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-7b-chat-hf
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>13B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-13b-chat-hf --tp 2
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>70B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-70b-chat-hf --tp 8
bash workspace/service_docker_up.sh
```
</details>
## Serving [LLaMA](https://github.com/facebookresearch/llama) ## Serving [LLaMA](https://github.com/facebookresearch/llama)
Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform) Weights for the LLaMA models can be obtained from by filling out [this form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform)
...@@ -8,7 +42,7 @@ Weights for the LLaMA models can be obtained from by filling out [this form](htt ...@@ -8,7 +42,7 @@ Weights for the LLaMA models can be obtained from by filling out [this form](htt
<summary><b>7B</b></summary> <summary><b>7B</b></summary>
```shell ```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-7B /path/to/llama-7b llama \ python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model --tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
...@@ -19,7 +53,7 @@ bash workspace/service_docker_up.sh ...@@ -19,7 +53,7 @@ bash workspace/service_docker_up.sh
<summary><b>13B</b></summary> <summary><b>13B</b></summary>
```shell ```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \ python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2 --tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
...@@ -30,7 +64,7 @@ bash workspace/service_docker_up.sh ...@@ -30,7 +64,7 @@ bash workspace/service_docker_up.sh
<summary><b>30B</b></summary> <summary><b>30B</b></summary>
```shell ```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-32B /path/to/llama-30b llama \ python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-30b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4 --tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
...@@ -41,7 +75,7 @@ bash workspace/service_docker_up.sh ...@@ -41,7 +75,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary> <summary><b>65B</b></summary>
```shell ```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \ python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8 --tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
...@@ -60,7 +94,7 @@ python3 -m fastchat.model.apply_delta \ ...@@ -60,7 +94,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \ --target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1 --delta-path lmsys/vicuna-7b-delta-v1.1
python3 -m lmdeploy.serve.turbomind.deploy vicuna-7B /path/to/vicuna-7b hf python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-7b
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
...@@ -76,7 +110,7 @@ python3 -m fastchat.model.apply_delta \ ...@@ -76,7 +110,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \ --target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1 --delta-path lmsys/vicuna-13b-delta-v1.1
python3 -m lmdeploy.serve.turbomind.deploy vicuna-13B /path/to/vicuna-13b hf python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-13b
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
......
# 模型服务 # 模型服务
## 部署 [LLaMA-2](https://github.com/facebookresearch/llama) 服务
请从[这里](https://huggingface.co/meta-llama) 下载 llama2 模型,参考如下命令部署服务:
<details open>
<summary><b>7B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-7b-chat-hf
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>13B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-13b-chat-hf --tp 2
bash workspace/service_docker_up.sh
```
</details>
<details open>
<summary><b>70B</b></summary>
```shell
python3 -m lmdeploy.serve.turbomind.deploy llama2 /path/to/llama-2-70b-chat-hf --tp 8
bash workspace/service_docker_up.sh
```
</details>
## 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务 ## 部署 [LLaMA](https://github.com/facebookresearch/llama) 服务
请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform),获取 LLaMA 模型权重 请填写[这张表](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform),获取 LLaMA 模型权重
...@@ -8,7 +42,7 @@ ...@@ -8,7 +42,7 @@
<summary><b>7B</b></summary> <summary><b>7B</b></summary>
```shell ```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-7B /path/to/llama-7b llama \ python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-7b llama \
--tokenizer_path /path/to/tokenizer/model --tokenizer_path /path/to/tokenizer/model
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
...@@ -19,7 +53,7 @@ bash workspace/service_docker_up.sh ...@@ -19,7 +53,7 @@ bash workspace/service_docker_up.sh
<summary><b>13B</b></summary> <summary><b>13B</b></summary>
```shell ```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-13B /path/to/llama-13b llama \ python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-13b llama \
--tokenizer_path /path/to/tokenizer/model --tp 2 --tokenizer_path /path/to/tokenizer/model --tp 2
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
...@@ -30,7 +64,7 @@ bash workspace/service_docker_up.sh ...@@ -30,7 +64,7 @@ bash workspace/service_docker_up.sh
<summary><b>30B</b></summary> <summary><b>30B</b></summary>
```shell ```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-32B /path/to/llama-30b llama \ python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-30b llama \
--tokenizer_path /path/to/tokenizer/model --tp 4 --tokenizer_path /path/to/tokenizer/model --tp 4
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
...@@ -41,7 +75,7 @@ bash workspace/service_docker_up.sh ...@@ -41,7 +75,7 @@ bash workspace/service_docker_up.sh
<summary><b>65B</b></summary> <summary><b>65B</b></summary>
```shell ```shell
python3 -m lmdeploy.serve.turbomind.deploy llama-65B /path/to/llama-65b llama \ python3 -m lmdeploy.serve.turbomind.deploy llama /path/to/llama-65b llama \
--tokenizer_path /path/to/tokenizer/model --tp 8 --tokenizer_path /path/to/tokenizer/model --tp 8
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
...@@ -60,7 +94,7 @@ python3 -m fastchat.model.apply_delta \ ...@@ -60,7 +94,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-7b \ --target-model-path /path/to/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1 --delta-path lmsys/vicuna-7b-delta-v1.1
python3 -m lmdeploy.serve.turbomind.deploy vicuna-7B /path/to/vicuna-7b hf python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-7b
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
...@@ -76,7 +110,7 @@ python3 -m fastchat.model.apply_delta \ ...@@ -76,7 +110,7 @@ python3 -m fastchat.model.apply_delta \
--target-model-path /path/to/vicuna-13b \ --target-model-path /path/to/vicuna-13b \
--delta-path lmsys/vicuna-13b-delta-v1.1 --delta-path lmsys/vicuna-13b-delta-v1.1
python3 -m lmdeploy.serve.turbomind.deploy vicuna-13B /path/to/vicuna-13b hf python3 -m lmdeploy.serve.turbomind.deploy vicuna /path/to/vicuna-13b
bash workspace/service_docker_up.sh bash workspace/service_docker_up.sh
``` ```
......
...@@ -101,28 +101,26 @@ def cancel_func( ...@@ -101,28 +101,26 @@ def cancel_func(
def run(triton_server_addr: str, def run(triton_server_addr: str,
model_name: str,
server_name: str = 'localhost', server_name: str = 'localhost',
server_port: int = 6006): server_port: int = 6006):
"""chat with AI assistant through web ui. """chat with AI assistant through web ui.
Args: Args:
triton_server_addr (str): the communication address of inference server triton_server_addr (str): the communication address of inference server
model_name (str): the name of the deployed model
server_name (str): the ip address of gradio server server_name (str): the ip address of gradio server
server_port (int): the port of gradio server server_port (int): the port of gradio server
""" """
with gr.Blocks(css=CSS, theme=THEME) as demo: with gr.Blocks(css=CSS, theme=THEME) as demo:
log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO')
_chatbot = Chatbot(triton_server_addr,
log_level=log_level,
display=True)
model_name = _chatbot.model_name
chat_interface = partial(chat_stream, model_name=model_name) chat_interface = partial(chat_stream, model_name=model_name)
reset_all = partial(reset_all_func, reset_all = partial(reset_all_func,
model_name=model_name, model_name=model_name,
triton_server_addr=triton_server_addr) triton_server_addr=triton_server_addr)
log_level = os.environ.get('SERVICE_LOG_LEVEL', 'INFO') llama_chatbot = gr.State(_chatbot)
llama_chatbot = gr.State(
Chatbot(triton_server_addr,
model_name,
log_level=log_level,
display=True))
state_chatbot = gr.State([]) state_chatbot = gr.State([])
with gr.Column(elem_id='container'): with gr.Column(elem_id='container'):
......
...@@ -4,11 +4,43 @@ from mmengine import Registry ...@@ -4,11 +4,43 @@ from mmengine import Registry
MODELS = Registry('model', locations=['lmdeploy.model']) MODELS = Registry('model', locations=['lmdeploy.model'])
@MODELS.register_module(name='llama')
class BaseModel:
"""Base model."""
def __init__(self):
self.session_len = 2048
self.top_p = 0.8
self.top_k = None
self.temperature = 0.8
self.repetition_penalty = 1.0
@staticmethod
def get_prompt(prompt, sequence_start=True):
"""Return the prompt that is concatenated with other elements in the
chat template.
Args:
prompt (str): user's input prompt
sequence_start (bool): indicator for the first round chat of a
session sequence
Returns:
str: the concatenated prompt
"""
return prompt
@property
def stop_words(self):
"""Return the stop-words' token ids."""
return None
@MODELS.register_module(name='vicuna') @MODELS.register_module(name='vicuna')
class Vicuna: class Vicuna(BaseModel):
"""Chat template of vicuna model.""" """Chat template of vicuna model."""
def __init__(self): def __init__(self):
super().__init__()
self.system = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. """ # noqa: E501 self.system = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. """ # noqa: E501
self.user = 'USER' self.user = 'USER'
self.assistant = 'ASSISTANT' self.assistant = 'ASSISTANT'
...@@ -29,17 +61,20 @@ class Vicuna: ...@@ -29,17 +61,20 @@ class Vicuna:
else: else:
return f'</s>{self.user}: {prompt} {self.assistant}:' return f'</s>{self.user}: {prompt} {self.assistant}:'
@property
def stop_words(self):
"""Return the stop-words' token ids."""
return None
@MODELS.register_module(name='internlm') @MODELS.register_module(name='internlm')
class InternLM: class InternLM(BaseModel):
def __init__(self):
super().__init__()
@MODELS.register_module(name='internlm-chat-7b')
class InternLMChat7B(BaseModel):
"""Chat template of InternLM model.""" """Chat template of InternLM model."""
def __init__(self): def __init__(self):
super().__init__()
self.system = '' self.system = ''
self.user = '<|User|>' self.user = '<|User|>'
self.eoh = '<eoh>' self.eoh = '<eoh>'
...@@ -70,38 +105,21 @@ class InternLM: ...@@ -70,38 +105,21 @@ class InternLM:
return [103027, 103028] return [103027, 103028]
@MODELS.register_module(name='llama') @MODELS.register_module(name='internlm-chat-7b-8k')
class Llama: class InternLMChat7B8K(InternLMChat7B):
"""Chat template of LLaMA model."""
def __init__(self): def __init__(self):
pass super(InternLMChat7B8K, self).__init__()
self.session_len = 8192
def get_prompt(self, prompt, sequence_start=True):
"""Return the prompt that is concatenated with other elements in the
chat template.
Args:
prompt (str): user's input prompt
sequence_start (bool): indicator for the first round chat of a
session sequence
Returns:
str: the concatenated prompt
"""
return prompt
@property
def stop_words(self):
"""Return the stop-words' token ids."""
return None
@MODELS.register_module(name='puyu') @MODELS.register_module(name='puyu')
class Puyu: class Puyu(BaseModel):
"""Chat template of puyu model.This is only for internal usage in Shanghai """Chat template of puyu model.This is only for internal usage in Shanghai
AI Laboratory.""" AI Laboratory."""
def __init__(self): def __init__(self):
super().__init__()
self.system = """meta instruction self.system = """meta instruction
You are an AI assistant whose name is InternLM (书生·浦语). You are an AI assistant whose name is InternLM (书生·浦语).
- 书生·浦语 is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless. - 书生·浦语 is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
...@@ -125,12 +143,20 @@ conversation""" # noqa: E501 ...@@ -125,12 +143,20 @@ conversation""" # noqa: E501
return [45623] return [45623]
@MODELS.register_module(name='baichuan-7b')
class Baichuan7B(BaseModel):
def __init__(self):
super().__init__()
self.repetition_penalty = 1.1
@MODELS.register_module(name='llama2') @MODELS.register_module(name='llama2')
class Llama2: class Llama2(BaseModel):
"""Chat template of LLaMA2 model.""" """Chat template of LLaMA2 model."""
def __init__(self): def __init__(self):
super().__init__()
B_INST, E_INST = '[INST]', '[/INST]' B_INST, E_INST = '[INST]', '[/INST]'
B_SYS, E_SYS = '<<SYS>>\n', '\n<</SYS>>\n\n' B_SYS, E_SYS = '<<SYS>>\n', '\n<</SYS>>\n\n'
...@@ -144,6 +170,7 @@ If a question does not make any sense, or is not factually coherent, explain why ...@@ -144,6 +170,7 @@ If a question does not make any sense, or is not factually coherent, explain why
self.b_sys = B_SYS self.b_sys = B_SYS
self.e_sys = E_SYS self.e_sys = E_SYS
self.default_sys_prompt = DEFAULT_SYSTEM_PROMPT self.default_sys_prompt = DEFAULT_SYSTEM_PROMPT
self.session_len = 4096
def get_prompt(self, prompt, sequence_start=True): def get_prompt(self, prompt, sequence_start=True):
"""Return the prompt that is concatenated with other elements in the """Return the prompt that is concatenated with other elements in the
...@@ -163,19 +190,15 @@ If a question does not make any sense, or is not factually coherent, explain why ...@@ -163,19 +190,15 @@ If a question does not make any sense, or is not factually coherent, explain why
return f'{self.b_inst} {prompt} {self.e_inst} ' return f'{self.b_inst} {prompt} {self.e_inst} '
@property
def stop_words(self):
"""Return the stop-words' token ids."""
return None
def main(model_name: str = 'test'): def main(model_name: str = 'test'):
assert model_name in MODELS.module_dict.keys(), \ assert model_name in MODELS.module_dict.keys(), \
f"'{model_name}' is not supported. " \ f"'{model_name}' is not supported. " \
f'The supported models are: {MODELS.module_dict.keys()}' f'The supported models are: {MODELS.module_dict.keys()}'
model = MODELS.get('vicuna--1')() model = MODELS.get(model_name)()
prompt = model.get_prompt(prompt='hi') prompt = model.get_prompt(prompt='hi')
print(prompt) print(prompt)
print(f'session_len: {model.session_len}')
if __name__ == '__main__': if __name__ == '__main__':
......
...@@ -13,7 +13,7 @@ def input_prompt(): ...@@ -13,7 +13,7 @@ def input_prompt():
return '\n'.join(iter(input, sentinel)) return '\n'.join(iter(input, sentinel))
def main(tritonserver_addr: str, model_name: str, session_id: int = 1): def main(tritonserver_addr: str, session_id: int = 1):
"""An example to communicate with inference server through the command line """An example to communicate with inference server through the command line
interface. interface.
...@@ -24,10 +24,7 @@ def main(tritonserver_addr: str, model_name: str, session_id: int = 1): ...@@ -24,10 +24,7 @@ def main(tritonserver_addr: str, model_name: str, session_id: int = 1):
session_id (int): the identical id of a session session_id (int): the identical id of a session
""" """
log_level = os.environ.get('SERVICE_LOG_LEVEL', 'WARNING') log_level = os.environ.get('SERVICE_LOG_LEVEL', 'WARNING')
chatbot = Chatbot(tritonserver_addr, chatbot = Chatbot(tritonserver_addr, log_level=log_level, display=True)
model_name,
log_level=log_level,
display=True)
nth_round = 1 nth_round = 1
while True: while True:
prompt = input_prompt() prompt = input_prompt()
......
...@@ -64,15 +64,6 @@ class Chatbot: ...@@ -64,15 +64,6 @@ class Chatbot:
tritonserver_addr (str): communicating address '<ip>:<port>' of tritonserver_addr (str): communicating address '<ip>:<port>' of
triton inference server triton inference server
model_name (str): name of the to-be-deployed mode model_name (str): name of the to-be-deployed mode
session_len (int): the maximum context length of the model
top_p (float): If set to float < 1, only the smallest set of most
probable tokens with probabilities that add up to top_p or higher
are kept for generation.
top_k (int): The number of the highest probability vocabulary tokens to
keep for top-k-filtering
temperature (float): to modulate the next token probability
repetition_penalty (float): The parameter for repetition penalty.
1.0 means no penalty
log_level (int): the level of the log log_level (int): the level of the log
display (bool): display the generated text on consolo or not display (bool): display the generated text on consolo or not
profile_generation (bool): profile token generation or not profile_generation (bool): profile token generation or not
...@@ -80,24 +71,18 @@ class Chatbot: ...@@ -80,24 +71,18 @@ class Chatbot:
def __init__(self, def __init__(self,
tritonserver_addr: str, tritonserver_addr: str,
model_name: str,
session_len: int = 2048,
top_p: float = 0.8,
top_k: int = None,
temperature: float = 0.8,
repetition_penalty: float = 1.0,
ignore_eos: bool = False, ignore_eos: bool = False,
log_level: int = logging.INFO, log_level: int = logging.INFO,
display: bool = False, display: bool = False,
profile_generation: bool = False, profile_generation: bool = False,
profile_serving: bool = False): profile_serving: bool = False):
assert model_name in MODELS.module_dict.keys(), \ self.tritonserver_addr = tritonserver_addr
f"'{model_name}' is not supported. " \ self.model_name = self._get_model_name()
assert self.model_name in MODELS.module_dict.keys(), \
f"'{self.model_name}' is not supported. " \
f'The supported models are: {MODELS.module_dict.keys()}' f'The supported models are: {MODELS.module_dict.keys()}'
self.model_name = model_name
self.model = MODELS.get(self.model_name)() self.model = MODELS.get(self.model_name)()
self._session = None self._session = None
self.tritonserver_addr = tritonserver_addr
self.preprocess = Preprocessor(tritonserver_addr) self.preprocess = Preprocessor(tritonserver_addr)
self.postprocess = Postprocessor(tritonserver_addr) self.postprocess = Postprocessor(tritonserver_addr)
self.bos_id = self._get_bos() self.bos_id = self._get_bos()
...@@ -108,11 +93,11 @@ class Chatbot: ...@@ -108,11 +93,11 @@ class Chatbot:
stop_words = None stop_words = None
bad_words = np.array([[[self.eos_id], [1]]], dtype=np.int32) bad_words = np.array([[[self.eos_id], [1]]], dtype=np.int32)
self.cfg = mmengine.Config( self.cfg = mmengine.Config(
dict(session_len=session_len, dict(session_len=self.model.session_len,
top_p=top_p, top_p=self.model.top_p,
top_k=top_k, top_k=self.model.top_k,
temperature=temperature, temperature=self.model.temperature,
repetition_penalty=repetition_penalty, repetition_penalty=self.model.repetition_penalty,
stop_words=stop_words, stop_words=stop_words,
bad_words=bad_words)) bad_words=bad_words))
self.log_level = log_level self.log_level = log_level
...@@ -167,12 +152,16 @@ class Chatbot: ...@@ -167,12 +152,16 @@ class Chatbot:
request_output_len, request_output_len,
sequence_start, sequence_start,
sequence_end): sequence_end):
yield status, res, tokens
if status.value < 0: if status.value < 0:
return break
else:
yield status, res, tokens
if status.value == 0:
self._session.histories = \ self._session.histories = \
self._session.histories + self._session.prompt + \ self._session.histories + self._session.prompt + \
self._session.response self._session.response
else:
yield status, res, tokens
def end(self, session_id: int, *args, **kwargs): def end(self, session_id: int, *args, **kwargs):
"""end a session. Triton inference server will release the session's """end a session. Triton inference server will release the session's
...@@ -208,11 +197,11 @@ class Chatbot: ...@@ -208,11 +197,11 @@ class Chatbot:
request_output_len=0, request_output_len=0,
sequence_start=False, sequence_start=False,
sequence_end=True): sequence_end=True):
if status != StatusCode.TRITON_STREAM_END: if status.value < 0:
return status break
self.reset_session() self.reset_session()
return StatusCode.TRITON_STREAM_END return status
def cancel(self, session_id: int, *args, **kwargs): def cancel(self, session_id: int, *args, **kwargs):
"""Cancel the session during generating tokens. """Cancel the session during generating tokens.
...@@ -243,6 +232,7 @@ class Chatbot: ...@@ -243,6 +232,7 @@ class Chatbot:
return StatusCode.TRITON_SESSION_CLOSED return StatusCode.TRITON_SESSION_CLOSED
prev_session = self._session prev_session = self._session
status, res = None, None
for status, res, _ in self._stream_infer(self._session, for status, res, _ in self._stream_infer(self._session,
prompt='', prompt='',
request_output_len=0, request_output_len=0,
...@@ -254,7 +244,7 @@ class Chatbot: ...@@ -254,7 +244,7 @@ class Chatbot:
if status == StatusCode.TRITON_STREAM_END: if status == StatusCode.TRITON_STREAM_END:
logger.info(f'cancel session {session_id} successfully') logger.info(f'cancel session {session_id} successfully')
if prev_session.histories: if prev_session.histories:
logger.warn(f'TODO: start to recover session {session_id}') logger.warning(f'TODO: start to recover session {session_id}')
else: else:
logger.info(f'cancel session {session_id} failed: {res}') logger.info(f'cancel session {session_id} failed: {res}')
return status return status
...@@ -295,7 +285,7 @@ class Chatbot: ...@@ -295,7 +285,7 @@ class Chatbot:
sequence_start=True, sequence_start=True,
sequence_end=False): sequence_end=False):
if status.value < 0: if status.value < 0:
return status break
self._session.histories = histories self._session.histories = histories
return status return status
...@@ -314,6 +304,14 @@ class Chatbot: ...@@ -314,6 +304,14 @@ class Chatbot:
"""set session.""" """set session."""
self._session = value self._session = value
def _get_model_name(self):
with grpcclient.InferenceServerClient(
self.tritonserver_addr) as client:
model_config = client.get_model_config(model_name='turbomind',
as_json=True)
return model_config['config']['parameters']['model_name'][
'string_value']
def _get_bos(self): def _get_bos(self):
"""return bos token id.""" """return bos token id."""
token_ids, _ = self.preprocess('<BOS>') token_ids, _ = self.preprocess('<BOS>')
...@@ -422,16 +420,12 @@ class Chatbot: ...@@ -422,16 +420,12 @@ class Chatbot:
request_output_len, sequence_start, request_output_len, sequence_start,
sequence_end, preseq_length, cancel)) sequence_end, preseq_length, cancel))
producer.start() producer.start()
for state, res, tokens in self.stream_consumer(self.postprocess, que, for status, res, n_token in self.stream_consumer(
session, input_tokens, self.postprocess, que, session, input_tokens, preseq_length,
preseq_length, cancel, cancel, logger, self.display, self.profile_generation,
logger, self.display,
self.profile_generation,
self.eos_id): self.eos_id):
if state.value < 0: yield status, res, n_token
yield state, res, 0
else:
yield state, res, tokens
producer.join() producer.join()
self._session = que.get() self._session = que.get()
curseq_length = self._session.sequence_length curseq_length = self._session.sequence_length
...@@ -543,11 +537,13 @@ class Chatbot: ...@@ -543,11 +537,13 @@ class Chatbot:
tuple: status, text, generated token number tuple: status, text, generated token number
""" """
offset = n_input_token + preseq_length offset = n_input_token + preseq_length
status, res, n_token = None, '', 0
while True: while True:
result = res_queue.get() result = res_queue.get()
if result is None: if result is None:
yield (StatusCode.TRITON_STREAM_END, session.response, status = StatusCode.TRITON_STREAM_END
session.sequence_length - offset) res = session.response
n_token = session.sequence_length - offset
session.status = StatusCode.TRITON_STREAM_END session.status = StatusCode.TRITON_STREAM_END
break break
if 'errcode' in result: if 'errcode' in result:
...@@ -555,7 +551,10 @@ class Chatbot: ...@@ -555,7 +551,10 @@ class Chatbot:
f"{result['errcode']}, {result['errmsg']}, " f"{result['errcode']}, {result['errmsg']}, "
f'token {session.sequence_length}') f'token {session.sequence_length}')
session.sequence_length = preseq_length session.sequence_length = preseq_length
yield result['errcode'], result['errmsg'], 0 session.response = ''
status = StatusCode.TRITON_SERVER_ERR
res = f"{result['errcode']}, {result['errmsg']}"
n_token = 0
break break
if cancel: if cancel:
continue continue
...@@ -601,3 +600,4 @@ class Chatbot: ...@@ -601,3 +600,4 @@ class Chatbot:
res_queue.put(session) res_queue.put(session)
if display: if display:
print('\n') print('\n')
yield status, res, n_token
...@@ -12,9 +12,16 @@ import safetensors ...@@ -12,9 +12,16 @@ import safetensors
import torch import torch
from sentencepiece import SentencePieceProcessor from sentencepiece import SentencePieceProcessor
from lmdeploy.model import MODELS
supported_formats = ['llama', 'hf'] supported_formats = ['llama', 'hf']
def get_package_root_path():
import importlib.resources as pkg_resources
return pkg_resources.path('lmdeploy', '')
def create_workspace(_path: str): def create_workspace(_path: str):
"""Create a workspace. """Create a workspace.
...@@ -164,6 +171,7 @@ def export(model_name: str, ...@@ -164,6 +171,7 @@ def export(model_name: str,
save_bin(param_data, param_name) save_bin(param_data, param_name)
# export config and save it to {out_dir}/config.ini # export config and save it to {out_dir}/config.ini
model = MODELS.get(model_name)()
vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path) vocab_size, bos_id, eos_id = tokenizer_info(tokenizer_path)
assert _vocab_size >= vocab_size, \ assert _vocab_size >= vocab_size, \
f'different vocab size {_vocab_size} vs {vocab_size}' f'different vocab size {_vocab_size} vs {vocab_size}'
...@@ -184,7 +192,7 @@ def export(model_name: str, ...@@ -184,7 +192,7 @@ def export(model_name: str,
# parameters for turbomind # parameters for turbomind
max_batch_size=32, max_batch_size=32,
max_context_token_num=4, max_context_token_num=4,
session_len=2056, session_len=model.session_len + 8,
step_length=1, step_length=1,
cache_max_entry_count=48, cache_max_entry_count=48,
cache_chunk_size=1, cache_chunk_size=1,
...@@ -226,6 +234,9 @@ def deploy_llama(model_name: str, model_path: str, tokenizer_path: str, ...@@ -226,6 +234,9 @@ def deploy_llama(model_name: str, model_path: str, tokenizer_path: str,
if osp.exists(tokenizer_path): if osp.exists(tokenizer_path):
shutil.copy(tokenizer_path, shutil.copy(tokenizer_path,
osp.join(triton_models_path, 'tokenizer/tokenizer.model')) osp.join(triton_models_path, 'tokenizer/tokenizer.model'))
with get_package_root_path() as root_path:
shutil.copy(osp.join(root_path, 'turbomind/tokenizer.py'),
osp.join(triton_models_path, 'tokenizer'))
else: else:
print(f'tokenizer model {tokenizer_path} does not exist') print(f'tokenizer model {tokenizer_path} does not exist')
return False return False
...@@ -352,6 +363,9 @@ def deploy_hf(model_name: str, model_path: str, tokenizer_path: str, ...@@ -352,6 +363,9 @@ def deploy_hf(model_name: str, model_path: str, tokenizer_path: str,
json_path = osp.join(model_path, _file) json_path = osp.join(model_path, _file)
shutil.copy(json_path, shutil.copy(json_path,
osp.join(triton_models_path, 'tokenizer', _file)) osp.join(triton_models_path, 'tokenizer', _file))
with get_package_root_path() as root_path:
shutil.copy(osp.join(root_path, 'turbomind/tokenizer.py'),
osp.join(triton_models_path, 'tokenizer'))
else: else:
print(f'tokenizer model {tokenizer_path} does not exist') print(f'tokenizer model {tokenizer_path} does not exist')
exit(-1) exit(-1)
...@@ -495,7 +509,7 @@ def pack_model_repository(workspace_path: str): ...@@ -495,7 +509,7 @@ def pack_model_repository(workspace_path: str):
def main(model_name: str, def main(model_name: str,
model_path: str, model_path: str,
model_format: str, model_format: str = 'hf',
tokenizer_path: str = None, tokenizer_path: str = None,
dst_path: str = './workspace', dst_path: str = './workspace',
tp: int = 1): tp: int = 1):
...@@ -511,6 +525,9 @@ def main(model_name: str, ...@@ -511,6 +525,9 @@ def main(model_name: str,
dst_path (str): the destination path that saves outputs dst_path (str): the destination path that saves outputs
tp (int): the number of GPUs used for tensor parallelism tp (int): the number of GPUs used for tensor parallelism
""" """
assert model_name in MODELS.module_dict.keys(), \
f"'{model_name}' is not supported. " \
f'The supported models are: {MODELS.module_dict.keys()}'
if model_format not in supported_formats: if model_format not in supported_formats:
print(f'the model format "{model_format}" is not supported. ' print(f'the model format "{model_format}" is not supported. '
...@@ -539,8 +556,11 @@ def main(model_name: str, ...@@ -539,8 +556,11 @@ def main(model_name: str,
# update `tensor_para_size` in `triton_models/interactive/config.pbtxt` # update `tensor_para_size` in `triton_models/interactive/config.pbtxt`
with open(osp.join(triton_models_path, 'interactive/config.pbtxt'), with open(osp.join(triton_models_path, 'interactive/config.pbtxt'),
'a') as f: 'a') as f:
param = 'parameters {\n key: "tensor_para_size"\n value: {\n ' \ param = \
'string_value: ' + f'"{tp}"\n' + ' }\n}\n' 'parameters {\n key: "tensor_para_size"\n value: {\n ' \
'string_value: ' + f'"{tp}"\n' + ' }\n}\n' + \
'parameters {\n key: "model_name"\n value: {\n ' \
'string_value: ' + f'"{model_name}"\n' + ' }\n}\n'
f.write(param) f.write(param)
if not res: if not res:
print(f'deploy model "{model_name}" via turbomind failed') print(f'deploy model "{model_name}" via turbomind failed')
......
...@@ -2,88 +2,15 @@ ...@@ -2,88 +2,15 @@
import json import json
import os.path as osp import os.path as osp
from pathlib import Path from pathlib import Path
from typing import List
import numpy as np import numpy as np
import triton_python_backend_utils as pb_utils import triton_python_backend_utils as pb_utils
# This tokenizer is `lmdeploy/turbomind/tokenizer.py`. When an LLM is served
class Tokenizer: # by triton inference server, it has to be converted first by running
"""Tokenize prompts or de-tokenize tokens into texts. # `python lmdeploy/serve/turbomind/deploy.py`. Then
# `lmdeploy/turbomind/tokenizer.py` will be copied to `tokenizer/tokenizer.py`
Args: from .tokenizer.tokenizer import Tokenizer
model_file (str): the path of the tokenizer model
"""
def __init__(self, model_file: str):
model_folder = osp.split(model_file)[0]
tokenizer_config_file = osp.join(model_folder, 'tokenizer_config.json')
use_hf_model = osp.exists(tokenizer_config_file)
self.use_hf_model = use_hf_model
if not self.use_hf_model:
from sentencepiece import SentencePieceProcessor
self.model = SentencePieceProcessor(model_file=model_file)
self.vocab_size = self.model.vocab_size()
self.start_id = self.model.bos_id()
self.end_id = self.model.eos_id()
else:
from transformers import AutoTokenizer
backend_tokenizer_file = osp.join(model_folder, 'tokenizer.json')
if not osp.exists(backend_tokenizer_file):
print('WARNING: Can not find tokenizer.json. '
'It may take long time to initialize the tokenizer.')
self.model = AutoTokenizer.from_pretrained(model_folder,
trust_remote_code=True)
self.vocab_size = self.model.vocab_size
self.start_id = self.model.bos_token_id
self.end_id = self.model.eos_token_id
# save tokenizer.json to reuse
if not osp.exists(backend_tokenizer_file) and \
hasattr(self.model, 'backend_tokenizer'):
self.model.backend_tokenizer.save(backend_tokenizer_file)
def encode(self, s: str):
"""Tokenize a prompt.
Args:
s (str): a prompt
Returns:
list[int]: token ids
"""
if not self.use_hf_model:
add_bos = False
add_eos = False
if s.find('<BOS>') != -1:
s = s.replace('<BOS>', '')
add_bos = True
if s == '<EOS>':
s = ''
add_eos = True
return self.model.Encode(s, add_bos=add_bos, add_eos=add_eos)
else:
add_special_tokens = False
if s.find('<BOS>') != -1:
s = s.replace('<BOS>', '<s>')
if s == '<EOS>':
s = '</s>'
if len(s) == 0:
add_special_tokens = True
return self.model.encode(s, add_special_tokens=add_special_tokens)
def decode(self, t: List[int]):
"""De-tokenize.
Args:
t (List[int]): a list of token ids
Returns:
str: text of decoding tokens
"""
if not self.use_hf_model:
return self.model.Decode(t)
else:
skip_special_tokens = False
return self.model.decode(t,
skip_special_tokens=skip_special_tokens)
class TritonPythonModel: class TritonPythonModel:
......
...@@ -2,90 +2,17 @@ ...@@ -2,90 +2,17 @@
import json import json
import os.path as osp import os.path as osp
from pathlib import Path from pathlib import Path
from typing import List
import numpy as np import numpy as np
import torch import torch
import triton_python_backend_utils as pb_utils import triton_python_backend_utils as pb_utils
from torch.nn.utils.rnn import pad_sequence from torch.nn.utils.rnn import pad_sequence
# This tokenizer is `lmdeploy/turbomind/tokenizer.py`. When an LLM is served
class Tokenizer: # by triton inference server, it has to be converted first by running
"""Tokenize prompts or de-tokenize tokens into texts. # `python lmdeploy/serve/turbomind/deploy.py`. Then
# `lmdeploy/turbomind/tokenizer.py` will be copied to `tokenizer/tokenizer.py`
Args: from .tokenizer.tokenizer import Tokenizer
model_file (str): the path of the tokenizer model
"""
def __init__(self, model_file: str):
model_folder = osp.split(model_file)[0]
tokenizer_config_file = osp.join(model_folder, 'tokenizer_config.json')
use_hf_model = osp.exists(tokenizer_config_file)
self.use_hf_model = use_hf_model
if not self.use_hf_model:
from sentencepiece import SentencePieceProcessor
self.model = SentencePieceProcessor(model_file=model_file)
self.vocab_size = self.model.vocab_size()
self.start_id = self.model.bos_id()
self.end_id = self.model.eos_id()
else:
from transformers import AutoTokenizer
backend_tokenizer_file = osp.join(model_folder, 'tokenizer.json')
if not osp.exists(backend_tokenizer_file):
print('WARNING: Can not find tokenizer.json. '
'It may take long time to initialize the tokenizer.')
self.model = AutoTokenizer.from_pretrained(model_folder,
trust_remote_code=True)
self.vocab_size = self.model.vocab_size
self.start_id = self.model.bos_token_id
self.end_id = self.model.eos_token_id
# save tokenizer.json to reuse
if not osp.exists(backend_tokenizer_file) and \
hasattr(self.model, 'backend_tokenizer'):
self.model.backend_tokenizer.save(backend_tokenizer_file)
def encode(self, s: str):
"""Tokenize a prompt.
Args:
s (str): a prompt
Returns:
list[int]: token ids
"""
if not self.use_hf_model:
add_bos = False
add_eos = False
if s.find('<BOS>') != -1:
s = s.replace('<BOS>', '')
add_bos = True
if s == '<EOS>':
s = ''
add_eos = True
return self.model.Encode(s, add_bos=add_bos, add_eos=add_eos)
else:
add_special_tokens = False
if s.find('<BOS>') != -1:
s = s.replace('<BOS>', '<s>')
if s == '<EOS>':
s = '</s>'
if len(s) == 0:
add_special_tokens = True
return self.model.encode(s, add_special_tokens=add_special_tokens)
def decode(self, t: List[int]):
"""De-tokenize.
Args:
t (List[int]): a list of token ids
Returns:
str: text of decoding tokens
"""
if not self.use_hf_model:
return self.model.Decode(t)
else:
skip_special_tokens = False
return self.model.decode(t,
skip_special_tokens=skip_special_tokens)
class TritonPythonModel: class TritonPythonModel:
...@@ -131,8 +58,8 @@ class TritonPythonModel: ...@@ -131,8 +58,8 @@ class TritonPythonModel:
osp.join( osp.join(
cur_folder, self.model_config['parameters']['tokenizer_path'] cur_folder, self.model_config['parameters']['tokenizer_path']
['string_value'])) ['string_value']))
self.start_id = self.tokenizer.start_id self.start_id = self.tokenizer.bos_token_id
self.end_id = self.tokenizer.end_id self.end_id = self.tokenizer.eos_token_id
def execute(self, requests): def execute(self, requests):
"""`execute` must be implemented in every Python model. `execute` """`execute` must be implemented in every Python model. `execute`
......
...@@ -29,29 +29,24 @@ def valid_str(string, coding='utf-8'): ...@@ -29,29 +29,24 @@ def valid_str(string, coding='utf-8'):
return ret return ret
def main(model_name, def main(model_path, session_id: int = 1, repetition_penalty: float = 1.0):
model_path,
session_id: int = 1,
repetition_penalty: float = 1.0):
"""An example to perform model inference through the command line """An example to perform model inference through the command line
interface. interface.
Args: Args:
model_name (str): the name of the deployed model
model_path (str): the path of the deployed model model_path (str): the path of the deployed model
session_id (int): the identical id of a session session_id (int): the identical id of a session
""" """
model = MODELS.get(model_name)()
tokenizer_model_path = osp.join(model_path, 'triton_models', 'tokenizer') tokenizer_model_path = osp.join(model_path, 'triton_models', 'tokenizer')
tokenizer = Tokenizer(tokenizer_model_path) tokenizer = Tokenizer(tokenizer_model_path)
tm_model = tm.TurboMind(model_path, tm_model = tm.TurboMind(model_path, eos_id=tokenizer.eos_token_id)
eos_id=tokenizer.eos_token_id,
stop_words=model.stop_words)
generator = tm_model.create_instance() generator = tm_model.create_instance()
nth_round = 1 nth_round = 1
step = 0 step = 0
seed = random.getrandbits(64) seed = random.getrandbits(64)
model_name = tm_model.model_name
model = MODELS.get(model_name)()
while True: while True:
prompt = input_prompt() prompt = input_prompt()
......
...@@ -12,6 +12,7 @@ import torch ...@@ -12,6 +12,7 @@ import torch
from torch.nn.utils.rnn import pad_sequence from torch.nn.utils.rnn import pad_sequence
import lmdeploy import lmdeploy
from lmdeploy.model import MODELS
# TODO: find another way import _turbomind # TODO: find another way import _turbomind
lmdeploy_dir = osp.split(lmdeploy.__file__)[0] lmdeploy_dir = osp.split(lmdeploy.__file__)[0]
...@@ -70,14 +71,12 @@ class TurboMind: ...@@ -70,14 +71,12 @@ class TurboMind:
model_path (str): the path of turbomind's model model_path (str): the path of turbomind's model
data_type (str): the data type data_type (str): the data type
eos_id (int): eos token id eos_id (int): eos token id
stop_words (List[int]): token ids of stop-words
""" """
def __init__(self, def __init__(self,
model_path: str, model_path: str,
data_type: str = 'fp16', data_type: str = 'fp16',
eos_id: int = 2, eos_id: int = 2):
stop_words: List[int] = None):
self.eos_id = eos_id self.eos_id = eos_id
# TODO: support mpi # TODO: support mpi
...@@ -101,6 +100,9 @@ class TurboMind: ...@@ -101,6 +100,9 @@ class TurboMind:
self.gpu_count = parser.getint(section_name, self.gpu_count = parser.getint(section_name,
'tensor_para_size') 'tensor_para_size')
self.session_len = parser.getint(section_name, 'session_len') self.session_len = parser.getint(section_name, 'session_len')
self.model_name = parser.get(section_name, 'model_name')
model = MODELS.get(self.model_name)()
self.stop_words = _stop_words(model.stop_words)
# params # params
self.node_id = node_id self.node_id = node_id
...@@ -129,8 +131,6 @@ class TurboMind: ...@@ -129,8 +131,6 @@ class TurboMind:
for t in threads: for t in threads:
t.join() t.join()
self.stop_words = _stop_words(stop_words)
def create_instance(self, cuda_stream_id=0): def create_instance(self, cuda_stream_id=0):
"""Create a turbomind instance. """Create a turbomind instance.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment