Commit ba9cb42a authored by zhouxiang's avatar zhouxiang
Browse files

提交chatglm3推理

parents
Pipeline #702 canceled with stages
# ChatGLM3
## 论文
`GLM: General Language Model Pretraining with Autoregressive Blank Infilling`
- [https://arxiv.org/abs/2103.10360](https://arxiv.org/abs/2103.10360)
## 模型结构
ChatGLM3 是智谱AI和清华大学 KEG 实验室联合发布的新一代对话预训练模型。ChatGLM3-6B 是 ChatGLM3 系列中的开源模型,在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上,ChatGLM3-6B 具有更强大的基础模型、更完整的功能支持、更全面的开源序列。
<div align="center">
<img src="doc/transformers.jpg" width="300" height="400">
</div>
以下是ChatGLM3-6B的主要网络参数配置:
| 模型名称 | 隐含层维度 | 层数 | 头数 | 词表大小 | 位置编码 | 最大序列长度 |
| ----------- | ---------- | ---- | ---- | -------- | -------- | ------------ |
| ChatGLM3-6B | 4096 | 28 | 32 | 65024 | RoPE | 8192 |
## 算法原理
ChatGLM3-6B基于GLM架构开发。GLM是一种基于Transformer的语言模型,以自回归空白填充为训练目标, 同时具备自回归和自编码能力。
<div align="center">
<img src="doc/GLM.png" width="550" height="200">
</div>
本项目主要针对ChatGLM2-6B推理性能优化,达到DCU平台较快的对话效果
## 环境配置
### 环境准备
在光源可拉取推理的docker镜像,拉取方式如下:
```
docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:lmdeploy-dtk23.10-torch1.13-py38
```
### 容器启动
模型推理容器启动命令参考如下,用户根据需要修改:
```
# <container_name> 自定义容器名
# <project_path> 当前工程所在路径
docker run -it --name=<container_name> -v <project_path>:/work -v /opt/hyhal:/opt/hyhal --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --ipc=host --network host --shm-size=16G --group-add video image.sourcefind.cn:5000/dcu/admin/base/custom:lmdeploy-dtk23.10-torch1.13-py38 /bin/bash
```
### 安装方法
```
#进入本工程目录
cd package
python setup.py install
```
## 数据集
## 推理
### 原版模型下载
[原版模型下载]([THUDM/chatglm3-6b · Hugging Face](https://huggingface.co/THUDM/chatglm3-6b))
### ChatGLM3原版模型转换
```
# 将模型转换脚本chatglm_export.py移动到原版ChatGLM3-6B环境中,也可以使用"pip3 install -r requirements.txt"命令根据工程自带的requirements.txt安装相关依赖
# 如果使用自己finetune的模型需要修改chatglm_export.py文件中创建tokenizer, model时的模型存放路径
# 执行:
python3 chatglm_export.py chatglm3-6b-fp16.bin float16 # 导出fp16模型,参数为导出的模型路径
python3 chatglm_export.py chatglm3-6b-int8.bin int8 # 导出int8模型,参数为导出的模型路径
```
### 运行ChatGLM3-6B模型实例
```
# 命令行聊天程序,使用了模型创建以及流式对话效果
python cli_demo.py -p chatglm3-6b-fp16.bin
# 简易webui,需要先安装streamlit-chat,并且需要在容器启动时映射streamlit的端口到外部网络
streamlit run web_demo.py chatglm3-6b-fp16.bin
# 按照openai接口实现的api_server的实例:
# 需要先进入api_server_demo,安装所需依赖:
cd api_server_demo
pip install -r requirements.txt
# 运行api_server服务,使用-p指定转换后的模型文件,客户端代码可以参考openai-client.py实现:
python fastllm-openai.py -p chatglm3-6b-fp16.bin
# 如果需要测试服务的并发性能,可以使用openai-client.py,修改其中的prompt和concurrencys变量值后执行:
python openai-client.py
```
## result
![chatglm3-6b推理](doc/chatglm3-6b.gif)
### 精度
## 应用场景
### 算法类别
`对话问答`
### 热点应用行业
`医疗,科研,金融,教育`
## 源码仓库及问题反馈
https://developer.hpccube.com/codes/modelzoo/chatglm3_fastllm
## 参考资料
https://github.com/THUDM/ChatGLM3
# coding=utf-8
# Implements API for ChatGLM3-6B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat)
# Usage: python openai_api.py
# Visit http://localhost:8100/docs for documents.
import time
import json
import torch
import uvicorn
import argparse
from pydantic import BaseModel, Field
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
from typing import Any, Dict, List, Literal, Optional, Union
#from transformers import AutoTokenizer, AutoModel
from sse_starlette.sse import ServerSentEvent, EventSourceResponse
from fastllm_pytools import llm
@asynccontextmanager
async def lifespan(app: FastAPI): # collects GPU memory
yield
global device_map
if torch.cuda.is_available():
for device in device_map:
with torch.cuda.device(device):
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
app = FastAPI(lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
class ModelCard(BaseModel):
id: str
object: str = "model"
created: int = Field(default_factory=lambda: int(time.time()))
owned_by: str = "owner"
root: Optional[str] = None
parent: Optional[str] = None
permission: Optional[list] = None
class ModelList(BaseModel):
object: str = "list"
data: List[ModelCard] = []
class ChatMessage(BaseModel):
role: Literal["user", "assistant", "system"]
content: str
class Usage(BaseModel):
prompt_tokens: int = None
total_tokens: int = None
completion_tokens: int = None
class DeltaMessage(BaseModel):
role: Optional[Literal["user", "assistant", "system"]] = None
content: Optional[str] = None
class ChatCompletionRequest(BaseModel):
model: str
messages: List[ChatMessage]
temperature: Optional[float] = None
top_p: Optional[float] = None
max_length: Optional[int] = None
stream: Optional[bool] = False
class ChatCompletionResponseChoice(BaseModel):
index: int
message: ChatMessage
finish_reason: Literal["stop", "length"]
class ChatCompletionResponseStreamChoice(BaseModel):
index: int
delta: DeltaMessage
finish_reason: Optional[Literal["stop", "length"]]
class ChatCompletionResponse(BaseModel):
id: str
object: Literal["chat.completion", "chat.completion.chunk"]
created: Optional[int] = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
usage: Usage = None
@app.get("/v1/models", response_model=ModelList)
def list_models():
global model_list
for model in model_list:
ModelCard(id=model)
ModelList.data.append(ModelCard)
return ModelList()
@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
def create_chat_completion(request: ChatCompletionRequest):
if request.model not in model_list:
raise HTTPException(status_code=400, detail="Invalid Model Name")
global model
id = "chatcmpl-A"
if request.messages[-1].role != "user":
raise HTTPException(status_code=400, detail="Invalid request")
query = request.messages[-1].content
if request.max_length is not None:
max_length = request.max_length
else:
max_length = 1024
if request.temperature is not None:
temperature = request.temperature
else:
temperature = 0.1
if request.top_p is not None:
top_p = request.top_p
else:
top_p = 0.8
prev_messages = request.messages[:-1]
# print(prev_messages)
if len(prev_messages) > 0 and prev_messages[0].role == "system":
query = prev_messages.pop(0).content + query
history = []
if len(prev_messages) % 2 == 0:
for i in range(0, len(prev_messages), 2):
if prev_messages[i].role == "user" and prev_messages[i+1].role == "assistant":
history.append([prev_messages[i].content, prev_messages[i+1].content])
if request.stream:
generate = predict(id=id, query=query, history=history, max_length=max_length, top_p = top_p, temperature = temperature, model_id = request.model)
return EventSourceResponse(generate, media_type="text/event-stream")
response = model.response(query=query, history=history, max_length=max_length, top_p = top_p, temperature = temperature)
choice_data = ChatCompletionResponseChoice(
index=0,
message=ChatMessage(role="assistant", content=response),
finish_reason="stop"
)
prompt_tokens = len(model.tokenizer_encode_string(query))
completion_tokens = len(model.tokenizer_encode_string(response))
usage = Usage(
prompt_tokens = prompt_tokens,
completion_tokens = completion_tokens,
total_tokens = prompt_tokens+completion_tokens,
)
return ChatCompletionResponse(id=id ,model=request.model, choices=[choice_data], object="chat.completion", usage=usage)
def predict(id: str, query: str, history: List[List[str]], model_id: str, max_length: int, top_p: float, temperature: float):
global model
creat_time = int(time.time())
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(role="assistant"),
finish_reason=None
)
chunk = ChatCompletionResponse(id=id, created=creat_time, model=model_id, choices=[choice_data], object="chat.completion.chunk")
#yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False)) //pydantic从1.8.0开始不支持dumps_kwags参数,参考https://github.com/THUDM/ChatGLM2-6B/issues/308
yield json.dumps(chunk.model_dump(exclude_unset=True), ensure_ascii=False)
for new_response in model.stream_response(query=query, history=history, max_length=max_length, top_p = top_p, temperature = temperature):
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(content=new_response),
finish_reason=None
)
chunk = ChatCompletionResponse(id=id, created=creat_time, model=model_id, choices=[choice_data], object="chat.completion.chunk")
#yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
yield json.dumps(chunk.model_dump(exclude_unset=True), ensure_ascii=False)
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(),
finish_reason="stop"
)
chunk = ChatCompletionResponse(id=id, created=creat_time, model=model_id, choices=[choice_data], object="chat.completion.chunk")
#yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
yield json.dumps(chunk.model_dump(exclude_unset=True), ensure_ascii=False)
yield '[DONE]'
def args_parser():
parser = argparse.ArgumentParser(description = 'baichuan2_chat_demo')
parser.add_argument('-p', '--path', type = str, default = "/model", help = '模型文件的路径')
parser.add_argument('-g', '--gpus', type = str, default = "0", help = '指定运行的gpu卡,例如“0,1”')
args = parser.parse_args()
return args
if __name__ == "__main__":
args = args_parser()
global model_list
model_list = ["chatglm3-6b-fastllm"]
global device_map
device_map = ["cuda:"+num for num in args.gpus.split(',')]
llm.set_device_map(device_map)
model = llm.model(args.path)
uvicorn.run(app, host='127.0.0.1', port=8100)
import openai
import time
import threading
import queue
from concurrent.futures import ThreadPoolExecutor, as_completed
def jls_extract_def(model, messages, temperature, max_length, stream, index):
openai.api_base = "http://127.0.0.1:8100/v1"
openai.api_key = "none"
output_tokens = 0
ret = ""
t0 = time.time()
result = openai.ChatCompletion.create(model=model,messages=messages, temperature=temperature, max_length=max_length, stream=stream)
for chunk in result:
# print(chunk)
output_tokens += 1
if hasattr(chunk.choices[0].delta, "content"):
if (index == 0):
print(chunk.choices[0].delta.content, end="", flush=True)
ret += chunk.choices[0].delta.content
t1 = time.time()
# print("\ntoken/s: {:.2f}, output_tokens: {}".format(output_tokens/(t1-t0),output_tokens))
result = output_tokens, ret, output_tokens/(t1-t0)
return result
if __name__ == "__main__":
prompt = "满江红全文"
concurrencys = [1]
temperature = 0.1
max_length = 4096
stream = True
prompts = [prompt]
model="chatglm3-6b-fastllm"
messages=[{"role": "user", "content": "你好"}]
pool = ThreadPoolExecutor(max_workers=32)
for i in range(len(concurrencys)):
cur_prompts = prompts * concurrencys[i]
token_count = 0
threads = []
t0 = time.time()
for index, prompt in enumerate(cur_prompts):
messages[0]["content"] = prompt
t = pool.submit(jls_extract_def, model, messages, temperature, max_length, stream, index)
t.index = index
threads.append(t)
for future in as_completed(threads):
result = future.result()
print(future.index)
print(result)
print("\n")
token_count += result[0]
t1 = time.time()
print("\n---------------------------------------------\n")
print("\nconcurrency: {}".format(concurrencys[i]))
print("\ntotal use: {:.2f}".format(t1-t0))
print("\ntoken/s: {:.2f}, token_count: {}".format(token_count/(t1-t0),token_count))
print("\n---------------------------------------------\n")
uvicorn==0.23.2
pydantic==2.5.1
fastapi==0.103.1
sse_starlette
import sys
from transformers import AutoTokenizer, AutoModel
from fastllm_pytools import torch2flm
if __name__ == "__main__":
model_path = "THUDM/chatglm3-6b"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
model = model.eval()
dtype = sys.argv[2] if len(sys.argv) >= 3 else "float16"
exportPath = sys.argv[1] if len(sys.argv) >= 2 else "chatglm-6b-" + dtype + ".flm"
torch2flm.tofile(exportPath, model, tokenizer, dtype = dtype)
# coding=utf-8
import argparse
from fastllm_pytools import llm
import time
def args_parser():
parser = argparse.ArgumentParser(description = 'fastllm_chat_demo')
parser.add_argument('-p', '--path', type = str, required = True, default = '', help = '模型文件的路径')
args = parser.parse_args()
return args
if __name__ == "__main__":
args = args_parser()
model = llm.model(args.path)
history = []
print("输入内容即可进行对话,clear 清空对话历史,stop 终止程序")
while True:
query = input("\n用户:")
if query.strip() == "stop":
break
if query.strip() == "clear":
history = []
print("输入内容即可进行对话,clear 清空对话历史,stop 终止程序")
continue
print("AI:", end = "")
curResponse = ""
prompt = model.get_prompt(query, history)
tokens = model.tokenizer_encode_string(prompt)
token_input_count = len(tokens)
print("token_input_count", token_input_count)
token_count = 0
t0 = time.time()
for response in model.stream_response(query, history = history, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01):
curResponse += response
print(response, flush = True, end = "")
token_count += 1
t1 = time.time()
word_len = len(curResponse)
print("\ntoken/s: {:.2f}, character/s: {:.2f}".format(token_count/(t1-t0), word_len/(t1-t0)))
history.append((query, curResponse))
\ No newline at end of file
# 模型唯一标识
modelCode = 514
# 模型名称
modelName=chatglm3_fastllm
# 模型描述
modelDescription=ChatGLM3 是智谱AI与清华大学KEG实验室联合发布的新一代对话预训练模型
# 应用场景
appScenario=推理,对话问答,医疗,科研,金融,教育
# 框架类型
frameType=fastllm
__all__ = ["llm"]
\ No newline at end of file
from fastllm_pytools import llm;
import torch;
import ctypes;
import numpy as np;
fastllm_data_type_dict = {
"int4": 8,
"int8": 3,
"float16": 7
}
fastllm_weight_type_dict = {
"linear": 1,
"embedding": 2,
"QuantizedLinear": 111
}
def create(model,
tokenizer = None,
pre_prompt = None,
user_role = None,
bot_role = None,
history_sep = None,
dtype = "float16"):
if (dtype not in fastllm_data_type_dict):
print("dtype should in ", list(fastllm_data_type_dict.keys()));
exit(0);
# 0.1 model info
# if model.config.model_type == "chatglm" and model.config.transformers_version == "4.30.2":
# model.config.model_type = "chatglm3"
# print("model.config.model_type: chatglm3!")
modelInfo = model.config.__dict__
if model.generation_config is not None:
modelInfo.update(model.generation_config.__dict__)
if (pre_prompt):
modelInfo["pre_prompt"] = pre_prompt;
if (user_role):
modelInfo["user_role"] = user_role;
if (bot_role):
modelInfo["bot_role"] = bot_role;
if (history_sep):
modelInfo["history_sep"] = history_sep;
if (modelInfo["model_type"] == "baichuan" and hasattr(model, "model") and hasattr(model.model, "get_alibi_mask")):
# Baichuan 2代
modelInfo["use_alibi"] = "1";
modelInfo["pre_prompt"] = "";
modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.user_token_id) + "> ") if hasattr(model.generation_config, "user_token_id") else "";
modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.assistant_token_id) + ">") if hasattr(model.generation_config, "assistant_token_id") else "";
modelInfo["history_sep"] = "";
if (modelInfo["model_type"] == "qwen"):
if modelInfo["chat_format"] == "chatml":
modelInfo["im_end_id"] = tokenizer.im_end_id
modelInfo["im_start_id"] = tokenizer.im_start_id
if (modelInfo["model_type"] == "chatglm" and hasattr(tokenizer, "build_chat_input")):
# chatglm3
modelInfo["pre_prompt"] = "";
modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(tokenizer.get_command("<|user|>")) + ">\n");
modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(tokenizer.get_command("<|assistant|>")) + ">");
modelInfo["history_sep"] = "";
weight_type_dict = {};
module_dict = {};
weight_bits = {};
for key, m in model.named_modules():
if (str(type(m)).find("QuantizedLinear") != -1):
weight_type_dict[key + ".weight"] = "QuantizedLinear";
weight_bits[key + ".weight"] = m.weight_bit_width;
if (isinstance(m, torch.nn.Linear)):
weight_type_dict[key + ".weight"] = "linear";
module_dict[key + ".weight"] = m;
if (isinstance(m, torch.nn.Embedding)):
weight_type_dict[key] = "embedding";
peft_config = {}
active_adapter = ""
if hasattr(model, "peft_config"):
peft_config = model.peft_config
if hasattr(model, "active_adapter") and isinstance(model.active_adapter, str):
# in transformers >= 4.33.0, active_adapter is a funtion in model, ignore it now
active_adapter = model.active_adapter
model = model.cpu();
dict = model.state_dict();
model_type = model.config.__dict__["model_type"];
model = llm.fastllm_lib.create_empty_llm_model(model_type.encode());
for it in modelInfo.keys():
llm.fastllm_lib.add_dict_llm_model(model, str(it).encode(), str(modelInfo[it]).encode());
for adapter_name in peft_config.keys():
adapter_dict = peft_config[adapter_name].__dict__
for it in adapter_dict.keys():
llm.fastllm_lib.add_adapter_dict_llm_model(model, str(adapter_name).encode(), str(it).encode(), str(adapter_dict[it]).encode())
if len(active_adapter) != 0:
llm.fastllm_lib.set_adapter(model, str(active_adapter).encode())
# 1. vocab
if (tokenizer):
if (hasattr(tokenizer, "tokenizer")):
if modelInfo["model_type"] == "qwen":
pass
else:
tokenizer = tokenizer.tokenizer;
if (hasattr(tokenizer, "sp_model")):
piece_size = tokenizer.sp_model.piece_size();
for i in range(piece_size):
llm.fastllm_lib.add_tokenizer_word_llm_model(model, tokenizer.sp_model.id_to_piece(i).encode(),
i, ctypes.c_float(tokenizer.sp_model.get_score(i)));
else:
vocab = tokenizer.get_vocab();
for v in vocab.keys():
if (modelInfo["model_type"] == "moss"):
vv = [(ord(c) if c not in tokenizer.byte_decoder else tokenizer.byte_decoder[c]) for c in v];
llm.fastllm_lib.add_tokenizer_word_llm_model(model, vv, vocab[v], ctypes.c_float(1.0));
elif (modelInfo["model_type"] == "qwen"):
llm.fastllm_lib.add_tokenizer_word_llm_model(model, v, vocab[v], ctypes.c_float(1.0));
else:
llm.fastllm_lib.add_tokenizer_word_llm_model(model, v.encode(), vocab[v], ctypes.c_float(1.0));
tot = 0;
for key in dict:
ori_data_type = 0;
ori_np_data_type = np.float32;
cur_weight_type = 0;
if (key in weight_type_dict and weight_type_dict[key] in fastllm_weight_type_dict):
cur_weight_type = fastllm_weight_type_dict[weight_type_dict[key]];
to_data_type = 0;
if (cur_weight_type == 1):
to_data_type = fastllm_data_type_dict[dtype];
if (to_data_type == 7):
ori_data_type = 7;
ori_np_data_type = np.float16;
elif (cur_weight_type == 2):
# TODO bfloat
to_data_type = 0;
weight_name = key
if peft_config is not None:
weight_name = weight_name.replace('base_model.model.', '')
if (cur_weight_type == 111):
llm.fastllm_lib.add_qlinear_weight_llm_model(model, weight_name.encode(),
len(dict[key].shape),
(ctypes.c_int * len(dict[key].shape))(*list(dict[key].shape)),
weight_bits[key],
dict[key + "_scale"].numpy().astype(np.float32).ctypes.data_as(ctypes.c_void_p),
dict[key].numpy().ctypes.data_as(ctypes.c_void_p));
else:
llm.fastllm_lib.add_weight_llm_model(model, weight_name.encode(),
len(dict[key].shape),
(ctypes.c_int * len(dict[key].shape))(*list(dict[key].shape)),
to_data_type, cur_weight_type, ori_data_type,
dict[key].numpy().astype(ori_np_data_type).ctypes.data_as(ctypes.c_void_p));
tot += 1;
print("convert (", tot, "/", len(dict), end = " )\r");
print("");
llm.fastllm_lib.init_params_llm_model(model);
llm.fastllm_lib.warmup_llm_model(model);
ret = llm.model("", id = model);
return ret;
This diff is collapsed.
import struct
import numpy as np
import torch
def writeString(fo, s):
fo.write(struct.pack('i', len(s)))
fo.write(s.encode())
def writeKeyValue(fo, key, value):
writeString(fo, key)
writeString(fo, value)
fastllm_data_type_dict = {
"int4": 8,
"int8": 3,
"float16": 7,
"float32": 0,
}
fastllm_weight_type_dict = {
"linear": 1,
"embedding": 2
}
v = np.random.randint(-127, 127, [10, 20]);
temp = v;
c_max = np.expand_dims(np.abs(v).max(axis = -1), -1)
c_scale = c_max / 127.0
v = (v / c_scale + 128.5).clip(1, 255).astype(np.uint8)
def write_int8(fo, v):
c_max = np.expand_dims(np.abs(v).max(axis = -1), -1).clip(0.1, 1e100)
c_scale = c_max / 127.0
v = (v / c_scale + 128.5).clip(1, 255).astype(np.uint8)
fo.write(struct.pack('i', 3))
fo.write(struct.pack('i', 0))
for i in range(c_max.shape[0]):
fo.write(struct.pack('f', -c_max[i][0]));
fo.write(struct.pack('f', c_max[i][0]));
fo.write(v.data)
def write_int4(fo, v):
# c_min = np.expand_dims(-np.abs(v).max(axis = -1), -1)
# c_max = np.expand_dims(np.abs(v).max(axis = -1), -1)
# c_scale = c_max / 7.0
# c_min = c_scale * -8.0
c_min = np.expand_dims(v.min(axis = -1), -1)
c_max = np.expand_dims(v.max(axis = -1), -1)
c_scale = (c_max - c_min) / 15.0
c_zero = np.round(0.0 - c_min / c_scale)
c_zero = c_zero.clip(0, 15)
c_min = -c_scale * c_zero
v = (v - c_min) / c_scale
v = (v + 0.5).astype(np.int8).clip(0, 15).astype(np.uint8)
v = v[:, 0::2] * 16 + v[:, 1::2]
fo.write(struct.pack('i', 8))
fo.write(struct.pack('i', 0))
for i in range(c_min.shape[0]):
fo.write(struct.pack('f', c_min[i][0]));
fo.write(struct.pack('f', c_max[i][0]));
fo.write(v.data)
def tofile(exportPath,
model,
tokenizer = None,
pre_prompt = None,
user_role = None,
bot_role = None,
history_sep = None,
dtype = "float16"):
if (dtype not in fastllm_data_type_dict):
print("dtype should in ", list(fastllm_data_type_dict.keys()))
exit(0)
dict = model.state_dict()
fo = open(exportPath, "wb")
# 0. version id
fo.write(struct.pack('i', 2))
# 0.1 model info
#if model.config.model_type == "chatglm" and model.config.transformers_version == "4.30.2":
# model.config.model_type = "chatglm3"
modelInfo = model.config.__dict__
if model.generation_config is not None:
modelInfo.update(model.generation_config.__dict__)
if ("model_type" not in modelInfo):
print("unknown model_type.")
exit(0)
if (pre_prompt):
modelInfo["pre_prompt"] = pre_prompt
if (user_role):
modelInfo["user_role"] = user_role
if (bot_role):
modelInfo["bot_role"] = bot_role
if (history_sep):
modelInfo["history_sep"] = history_sep
if (modelInfo["model_type"] == "baichuan" and hasattr(model, "model") and hasattr(model.model, "get_alibi_mask")):
# Baichuan 2代
modelInfo["use_alibi"] = "1"
modelInfo["pre_prompt"] = ""
modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.user_token_id) + ">") if hasattr(model.generation_config, "user_token_id") else "";
modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.assistant_token_id) + ">") if hasattr(model.generation_config, "assistant_token_id") else "";
modelInfo["history_sep"] = ""
if (modelInfo["model_type"] == "baichuan" and modelInfo["vocab_size"] == 125696):
# Baichuan 2代 7B
modelInfo["pre_prompt"] = ""
modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.user_token_id) + ">") if hasattr(model.generation_config, "user_token_id") else "";
modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.assistant_token_id) + ">") if hasattr(model.generation_config, "assistant_token_id") else "";
modelInfo["history_sep"] = ""
if modelInfo["model_type"] == "qwen":
if modelInfo["chat_format"] == "chatml":
modelInfo["im_end_id"] = tokenizer.im_end_id
modelInfo["im_start_id"] = tokenizer.im_start_id
if (modelInfo["model_type"] == "chatglm" and hasattr(tokenizer, "build_chat_input")):
print("chatglm3")
# chatglm3
modelInfo["pre_prompt"] = "";
modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(tokenizer.get_command("<|user|>")) + ">\n");
modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(tokenizer.get_command("<|assistant|>")) + ">");
modelInfo["history_sep"] = "";
modelInfo["tokenizer_use_score"] = "1" # 分词带分数
if hasattr(model, "peft_config"):
adapter_size = len(model.peft_config)
modelInfo["peft_size"] = adapter_size
fo.write(struct.pack('i', len(modelInfo)))
for it in modelInfo.keys():
writeKeyValue(fo, str(it), str(modelInfo[it]))
if hasattr(model, "peft_config"):
for adapter_name in model.peft_config.keys():
adapter_dict = model.peft_config[adapter_name].__dict__
writeString(fo, adapter_name)
fo.write(struct.pack('i', len(adapter_dict)))
for it in adapter_dict.keys():
writeKeyValue(fo, str(it), str(adapter_dict[it]))
# 1. vocab
if (tokenizer):
if (hasattr(tokenizer, "tokenizer")):
if (modelInfo['model_type'] == "qwen"):
pass
else:
tokenizer = tokenizer.tokenizer
if (hasattr(tokenizer, "sp_model")):
piece_size = tokenizer.sp_model.piece_size()
fo.write(struct.pack('i', piece_size))
for i in range(piece_size):
s = tokenizer.sp_model.id_to_piece(i).encode()
fo.write(struct.pack('i', len(s)))
for c in s:
fo.write(struct.pack('i', c))
fo.write(struct.pack('i', i))
fo.write(struct.pack('f', float(tokenizer.sp_model.get_score(i))))
else:
vocab = tokenizer.get_vocab()
fo.write(struct.pack('i', len(vocab)))
for v in vocab.keys():
if (modelInfo['model_type'] == "qwen"):
s = v
elif (modelInfo["model_type"] == "moss"):
s = [(ord(c) if c not in tokenizer.byte_decoder else tokenizer.byte_decoder[c]) for c in v]
else:
s = v.encode()
fo.write(struct.pack('i', len(s)))
for c in s:
fo.write(struct.pack('i', c))
fo.write(struct.pack('i', vocab[v]))
fo.write(struct.pack('f', 1.0))
else:
fo.write(struct.pack('i', 0))
weight_type_dict = {}
module_dict = {}
for key, m in model.named_modules():
if (isinstance(m, torch.nn.Linear)):
weight_type_dict[key + ".weight"] = "linear"
module_dict[key + ".weight"] = m
if (isinstance(m, torch.nn.Embedding)):
weight_type_dict[key] = "embedding"
# 2. weight
fo.write(struct.pack('i', len(dict)))
tot = 0
for key in dict:
ori_data_type = 0
ori_np_data_type = np.float32
cur_weight_type = 0
if (key in weight_type_dict and weight_type_dict[key] in fastllm_weight_type_dict):
cur_weight_type = fastllm_weight_type_dict[weight_type_dict[key]]
to_data_type = 0
if (cur_weight_type == 1):
to_data_type = fastllm_data_type_dict[dtype]
if (to_data_type == 7):
ori_data_type = 7
ori_np_data_type = np.float16
cur = dict[key].numpy().astype(ori_np_data_type)
if hasattr(model, "peft_config"):
weight_name = key.replace('base_model.model.', '')
fo.write(struct.pack('i', len(weight_name)))
fo.write(weight_name.encode())
else:
fo.write(struct.pack('i', len(key)))
fo.write(key.encode())
fo.write(struct.pack('i', len(cur.shape)))
for i in cur.shape:
fo.write(struct.pack('i', i))
if (to_data_type == 3):
write_int8(fo, cur)
elif (to_data_type == 8):
write_int4(fo, cur)
else:
fo.write(struct.pack('i', to_data_type))
fo.write(cur.data)
tot += 1
print("output (", tot, "/", len(dict), end = " )\r")
print("\nfinish.")
fo.close()
\ No newline at end of file
from setuptools import setup, find_packages
setup (
name = "fastllm_pytools",
version = "0.0.1",
description = "Fastllm pytools",
packages = ['fastllm_pytools'],
url = "https://developer.hpccube.com/codes/aicomponent/fastllm",
package_data = {
'': ['*.dll', '*.so']
}
)
import streamlit as st
from streamlit_chat import message
from fastllm_pytools import llm
import sys
st.set_page_config(
page_title="fastllm web demo",
page_icon=":robot:"
)
@st.cache_resource
def get_model():
model = llm.model(sys.argv[1])
return model
if "messages" not in st.session_state:
st.session_state.messages = []
for i, (prompt, response) in enumerate(st.session_state.messages):
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
st.markdown(response)
if prompt := st.chat_input("请开始对话"):
model = get_model()
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response = ""
for chunk in model.stream_response(prompt, st.session_state.messages, one_by_one = True):
full_response += chunk
message_placeholder.markdown(full_response + "▌")
message_placeholder.markdown(full_response)
st.session_state.messages.append((prompt, full_response))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment