提交chatglm3推理

ba9cb42a · zhouxiang · ba9cb42a · ba9cb42a · ba9cb42a · ba9cb42a
Commit ba9cb42a authored Jan 10, 2024 by zhouxiang
17 changed files
--- a/README.md
+++ b/README.md
+# ChatGLM3
+
+## 论文
+
+`GLM: General Language Model Pretraining with Autoregressive Blank Infilling`
+
+- [https://arxiv.org/abs/2103.10360](https://arxiv.org/abs/2103.10360)
+
+## 模型结构
+
+ChatGLM3 是智谱AI和清华大学 KEG 实验室联合发布的新一代对话预训练模型。ChatGLM3-6B 是 ChatGLM3 系列中的开源模型，在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上，ChatGLM3-6B 具有更强大的基础模型、更完整的功能支持、更全面的开源序列。
+
+<div align="center">
+<img src="doc/transformers.jpg" width="300" height="400">
+</div>
+
+
+
+
+以下是ChatGLM3-6B的主要网络参数配置：
+
+
+| 模型名称    | 隐含层维度 | 层数 | 头数 | 词表大小 | 位置编码 | 最大序列长度 |
+| ----------- | ---------- | ---- | ---- | -------- | -------- | ------------ |
+| ChatGLM3-6B | 4096       | 28   | 32   | 65024    | RoPE     | 8192         |
+
+## 算法原理
+
+ChatGLM3-6B基于GLM架构开发。GLM是一种基于Transformer的语言模型，以自回归空白填充为训练目标， 同时具备自回归和自编码能力。
+
+<div align="center">
+<img src="doc/GLM.png" width="550" height="200">
+</div>
+
+本项目主要针对ChatGLM2-6B推理性能优化，达到DCU平台较快的对话效果
+
+## 环境配置
+
+### 环境准备
+
+在光源可拉取推理的docker镜像，拉取方式如下：
+
+```
+docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:lmdeploy-dtk23.10-torch1.13-py38
+```
+
+### 容器启动
+
+模型推理容器启动命令参考如下，用户根据需要修改：
+
+```
+# <container_name> 自定义容器名
+# <project_path> 当前工程所在路径
+docker run -it --name=<container_name> -v <project_path>:/work -v /opt/hyhal:/opt/hyhal --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --ipc=host --network host --shm-size=16G --group-add video image.sourcefind.cn:5000/dcu/admin/base/custom:lmdeploy-dtk23.10-torch1.13-py38 /bin/bash
+```
+
+### 安装方法
+
+```
+#进入本工程目录
+cd package
+python setup.py install
+```
+
+## 数据集
+
+无
+
+## 推理
+
+### 原版模型下载
+
+[原版模型下载]([THUDM/chatglm3-6b · Hugging Face](https://huggingface.co/THUDM/chatglm3-6b))
+
+### ChatGLM3原版模型转换
+
+```
+# 将模型转换脚本chatglm_export.py移动到原版ChatGLM3-6B环境中，也可以使用"pip3 install -r requirements.txt"命令根据工程自带的requirements.txt安装相关依赖
+# 如果使用自己finetune的模型需要修改chatglm_export.py文件中创建tokenizer, model时的模型存放路径
+# 执行：
+python3 chatglm_export.py chatglm3-6b-fp16.bin float16 # 导出fp16模型，参数为导出的模型路径
+python3 chatglm_export.py chatglm3-6b-int8.bin int8 #    导出int8模型，参数为导出的模型路径
+```
+
+
+### 运行ChatGLM3-6B模型实例
+
+```
+# 命令行聊天程序，使用了模型创建以及流式对话效果
+python cli_demo.py -p chatglm3-6b-fp16.bin
+
+# 简易webui，需要先安装streamlit-chat，并且需要在容器启动时映射streamlit的端口到外部网络
+streamlit run web_demo.py chatglm3-6b-fp16.bin 
+
+# 按照openai接口实现的api_server的实例:
+# 需要先进入api_server_demo，安装所需依赖：
+cd api_server_demo
+pip install -r requirements.txt
+# 运行api_server服务，使用-p指定转换后的模型文件，客户端代码可以参考openai-client.py实现：
+python fastllm-openai.py -p chatglm3-6b-fp16.bin 
+# 如果需要测试服务的并发性能，可以使用openai-client.py，修改其中的prompt和concurrencys变量值后执行：
+python openai-client.py
+```
+
+
+## result
+
+![chatglm3-6b推理](doc/chatglm3-6b.gif)
+
+### 精度
+
+无
+
+## 应用场景
+
+### 算法类别
+
+`对话问答`
+
+### 热点应用行业
+
+`医疗,科研,金融,教育`
+
+## 源码仓库及问题反馈
+
+https://developer.hpccube.com/codes/modelzoo/chatglm3_fastllm
+
+## 参考资料
+
+https://github.com/THUDM/ChatGLM3
--- a/api_server_demo/fastllm-openai.py
+++ b/api_server_demo/fastllm-openai.py
+# coding=utf-8
+# Implements API for ChatGLM3-6B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat)
+# Usage: python openai_api.py
+# Visit http://localhost:8100/docs for documents.
+
+
+import time
+import json
+import torch
+import uvicorn
+import argparse
+from pydantic import BaseModel, Field
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from contextlib import asynccontextmanager
+from typing import Any, Dict, List, Literal, Optional, Union
+#from transformers import AutoTokenizer, AutoModel
+from sse_starlette.sse import ServerSentEvent, EventSourceResponse
+from fastllm_pytools import llm
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI): # collects GPU memory
+    yield
+    global device_map
+    if torch.cuda.is_available():
+        for device in device_map: 
+            with torch.cuda.device(device):
+                torch.cuda.empty_cache()
+                torch.cuda.ipc_collect()
+
+
+app = FastAPI(lifespan=lifespan)
+
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+class ModelCard(BaseModel):
+    id: str
+    object: str = "model"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    owned_by: str = "owner"
+    root: Optional[str] = None
+    parent: Optional[str] = None
+    permission: Optional[list] = None
+
+
+class ModelList(BaseModel):
+    object: str = "list"
+    data: List[ModelCard] = []
+
+
+class ChatMessage(BaseModel):
+    role: Literal["user", "assistant", "system"]
+    content: str
+
+class Usage(BaseModel):
+    prompt_tokens: int = None
+    total_tokens: int = None
+    completion_tokens: int = None
+
+class DeltaMessage(BaseModel):
+    role: Optional[Literal["user", "assistant", "system"]] = None
+    content: Optional[str] = None
+
+
+class ChatCompletionRequest(BaseModel):
+    model: str
+    messages: List[ChatMessage]
+    temperature: Optional[float] = None
+    top_p: Optional[float] = None
+    max_length: Optional[int] = None
+    stream: Optional[bool] = False
+
+
+class ChatCompletionResponseChoice(BaseModel):
+    index: int
+    message: ChatMessage
+    finish_reason: Literal["stop", "length"]
+
+
+class ChatCompletionResponseStreamChoice(BaseModel):
+    index: int
+    delta: DeltaMessage
+    finish_reason: Optional[Literal["stop", "length"]]
+
+
+class ChatCompletionResponse(BaseModel):
+    id: str
+    object: Literal["chat.completion", "chat.completion.chunk"]
+    created: Optional[int] = Field(default_factory=lambda: int(time.time()))
+    model: str
+    choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
+    usage: Usage = None
+
+
+@app.get("/v1/models", response_model=ModelList)
+def list_models():
+    global model_list
+    for model in model_list:
+        ModelCard(id=model)
+        ModelList.data.append(ModelCard)
+    return ModelList()
+
+
+@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
+def create_chat_completion(request: ChatCompletionRequest):
+    if request.model not in model_list:
+        raise HTTPException(status_code=400, detail="Invalid Model Name")
+
+    global model
+
+    id = "chatcmpl-A"
+
+    if request.messages[-1].role != "user":
+        raise HTTPException(status_code=400, detail="Invalid request")
+    query = request.messages[-1].content
+
+
+    if request.max_length is not None:
+        max_length = request.max_length
+    else:
+        max_length = 1024
+    
+    if request.temperature is not None:
+        temperature = request.temperature
+    else:
+        temperature = 0.1
+
+
+    if request.top_p is not None:
+        top_p = request.top_p
+    else:
+        top_p = 0.8
+
+    prev_messages = request.messages[:-1]
+    # print(prev_messages)
+    if len(prev_messages) > 0 and prev_messages[0].role == "system":
+        query = prev_messages.pop(0).content + query
+
+    history = []
+    if len(prev_messages) % 2 == 0:
+        for i in range(0, len(prev_messages), 2):
+            if prev_messages[i].role == "user" and prev_messages[i+1].role == "assistant":
+                history.append([prev_messages[i].content, prev_messages[i+1].content])
+    
+    if request.stream:
+        generate = predict(id=id, query=query,  history=history, max_length=max_length, top_p = top_p, temperature = temperature, model_id = request.model)
+        return EventSourceResponse(generate, media_type="text/event-stream")
+
+    response = model.response(query=query,  history=history, max_length=max_length, top_p = top_p, temperature = temperature)
+
+
+    choice_data = ChatCompletionResponseChoice(
+        index=0,
+        message=ChatMessage(role="assistant", content=response),
+        finish_reason="stop"
+    )
+
+    prompt_tokens = len(model.tokenizer_encode_string(query))
+    completion_tokens = len(model.tokenizer_encode_string(response))
+    usage = Usage(
+        prompt_tokens = prompt_tokens,
+        completion_tokens = completion_tokens,
+        total_tokens = prompt_tokens+completion_tokens,
+    )
+
+    return ChatCompletionResponse(id=id ,model=request.model, choices=[choice_data], object="chat.completion", usage=usage)
+
+
+def predict(id: str, query: str, history: List[List[str]], model_id: str, max_length: int, top_p: float, temperature: float):
+    global model
+    creat_time = int(time.time())
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(role="assistant"),
+        finish_reason=None
+    )
+    chunk = ChatCompletionResponse(id=id, created=creat_time, model=model_id, choices=[choice_data], object="chat.completion.chunk")
+    #yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))  //pydantic从1.8.0开始不支持dumps_kwags参数，参考https://github.com/THUDM/ChatGLM2-6B/issues/308
+    yield json.dumps(chunk.model_dump(exclude_unset=True), ensure_ascii=False)
+
+    for new_response in model.stream_response(query=query,  history=history, max_length=max_length, top_p = top_p, temperature = temperature):
+        choice_data = ChatCompletionResponseStreamChoice(
+            index=0,
+            delta=DeltaMessage(content=new_response),
+            finish_reason=None
+        )
+        chunk = ChatCompletionResponse(id=id, created=creat_time, model=model_id, choices=[choice_data], object="chat.completion.chunk")
+        #yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
+        yield json.dumps(chunk.model_dump(exclude_unset=True), ensure_ascii=False)
+
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(),
+        finish_reason="stop"
+    )
+    chunk = ChatCompletionResponse(id=id, created=creat_time, model=model_id, choices=[choice_data], object="chat.completion.chunk")
+    #yield "{}".format(chunk.json(exclude_unset=True, ensure_ascii=False))
+    yield json.dumps(chunk.model_dump(exclude_unset=True), ensure_ascii=False)
+    yield '[DONE]'
+
+def args_parser():
+    parser = argparse.ArgumentParser(description = 'baichuan2_chat_demo')
+    parser.add_argument('-p', '--path', type = str, default = "/model", help = '模型文件的路径')
+    parser.add_argument('-g', '--gpus', type = str, default = "0", help = '指定运行的gpu卡，例如“0，1”')
+    args = parser.parse_args()
+    return args
+
+
+if __name__ == "__main__":
+    args = args_parser()
+    global model_list
+    model_list = ["chatglm3-6b-fastllm"]
+    global device_map
+    device_map  = ["cuda:"+num for num in args.gpus.split(',')]
+    llm.set_device_map(device_map)
+    model = llm.model(args.path)
+    uvicorn.run(app, host='127.0.0.1', port=8100)
+
--- a/api_server_demo/openai-client.py
+++ b/api_server_demo/openai-client.py
+import openai
+import time
+import threading
+import queue
+from concurrent.futures import ThreadPoolExecutor, as_completed
+
+def jls_extract_def(model, messages, temperature, max_length, stream, index):
+    openai.api_base = "http://127.0.0.1:8100/v1"
+    openai.api_key = "none"
+    output_tokens = 0
+    ret = ""
+    
+    t0 = time.time()
+    result = openai.ChatCompletion.create(model=model,messages=messages, temperature=temperature, max_length=max_length, stream=stream)
+
+    for chunk in result:
+        # print(chunk)
+        output_tokens += 1
+        if hasattr(chunk.choices[0].delta, "content"):
+            if (index == 0):
+                print(chunk.choices[0].delta.content, end="", flush=True)
+            ret += chunk.choices[0].delta.content
+    t1 = time.time()
+    # print("\ntoken/s: {:.2f}, output_tokens: {}".format(output_tokens/(t1-t0),output_tokens))  
+    result = output_tokens, ret, output_tokens/(t1-t0)
+
+    return result
+
+if __name__ == "__main__":
+    prompt = "满江红全文"
+    concurrencys = [1]
+    
+    temperature = 0.1
+    max_length = 4096
+    stream = True
+    
+    prompts = [prompt] 
+    model="chatglm3-6b-fastllm"
+    messages=[{"role": "user", "content": "你好"}]
+    
+    pool = ThreadPoolExecutor(max_workers=32)
+
+    for i in range(len(concurrencys)):
+        cur_prompts = prompts * concurrencys[i]
+        token_count = 0
+        threads = []
+        t0 = time.time()
+        for index, prompt in enumerate(cur_prompts):
+            messages[0]["content"] = prompt
+
+            t = pool.submit(jls_extract_def, model, messages, temperature, max_length, stream, index)
+            t.index = index
+            threads.append(t)
+
+        for future in as_completed(threads):
+            result = future.result()
+            print(future.index)
+            print(result)
+            print("\n")
+            token_count += result[0]
+
+        t1 = time.time()
+
+        print("\n---------------------------------------------\n")
+        print("\nconcurrency: {}".format(concurrencys[i]))
+        print("\ntotal use: {:.2f}".format(t1-t0))
+        print("\ntoken/s: {:.2f}, token_count: {}".format(token_count/(t1-t0),token_count))  
+        print("\n---------------------------------------------\n")
+
+
+
+
+
+
+
+
--- a/api_server_demo/requirements.txt
+++ b/api_server_demo/requirements.txt
+uvicorn==0.23.2
+pydantic==2.5.1
+fastapi==0.103.1
+sse_starlette
+
--- a/chatglm_export.py
+++ b/chatglm_export.py
+import sys
+from transformers import AutoTokenizer, AutoModel
+from fastllm_pytools import torch2flm
+
+if __name__ == "__main__":
+    model_path = "THUDM/chatglm3-6b"
+    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+    model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
+    model = model.eval()
+
+    dtype = sys.argv[2] if len(sys.argv) >= 3 else "float16"
+    exportPath = sys.argv[1] if len(sys.argv) >= 2 else "chatglm-6b-" + dtype + ".flm"
+    torch2flm.tofile(exportPath, model, tokenizer, dtype = dtype)
--- a/cli_demo.py
+++ b/cli_demo.py
+# coding=utf-8
+import argparse
+from fastllm_pytools import llm
+import time
+
+def args_parser():
+    parser = argparse.ArgumentParser(description = 'fastllm_chat_demo')
+    parser.add_argument('-p', '--path', type = str, required = True, default = '', help = '模型文件的路径')
+    args = parser.parse_args()
+    return args
+
+if __name__ == "__main__":
+    args = args_parser()
+    model = llm.model(args.path)
+
+    history = []
+    print("输入内容即可进行对话，clear 清空对话历史，stop 终止程序")
+    while True:
+        query = input("\n用户：")
+        if query.strip() == "stop":
+            break
+        if query.strip() == "clear":
+            history = []
+            print("输入内容即可进行对话，clear 清空对话历史，stop 终止程序")
+            continue
+        print("AI:", end = "")
+        curResponse = ""
+
+        prompt = model.get_prompt(query, history)
+        tokens = model.tokenizer_encode_string(prompt)
+        token_input_count = len(tokens)
+        print("token_input_count", token_input_count)
+
+        token_count = 0
+        t0 = time.time()
+        for response in model.stream_response(query, history = history, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01):
+            curResponse += response
+            print(response, flush = True, end = "")
+            token_count += 1
+
+        t1 = time.time()
+        word_len = len(curResponse)
+        print("\ntoken/s: {:.2f}, character/s: {:.2f}".format(token_count/(t1-t0), word_len/(t1-t0)))
+
+        history.append((query, curResponse))
\ No newline at end of file
--- a/doc/GLM.png
+++ b/doc/GLM.png
--- a/doc/transformers.jpg
+++ b/doc/transformers.jpg
--- a/model.properties
+++ b/model.properties
+# 模型唯一标识
+modelCode = 514
+# 模型名称
+modelName=chatglm3_fastllm
+# 模型描述
+modelDescription=ChatGLM3 是智谱AI与清华大学KEG实验室联合发布的新一代对话预训练模型
+# 应用场景
+appScenario=推理,对话问答,医疗,科研,金融,教育
+# 框架类型
+frameType=fastllm
--- a/package/fastllm_pytools/__init__.py
+++ b/package/fastllm_pytools/__init__.py
+__all__ = ["llm"]
\ No newline at end of file
--- a/package/fastllm_pytools/hf_model.py
+++ b/package/fastllm_pytools/hf_model.py
+from fastllm_pytools import llm;
+import torch;
+import ctypes;
+import numpy as np;
+
+fastllm_data_type_dict = {
+    "int4": 8,
+    "int8": 3,
+    "float16": 7
+}
+fastllm_weight_type_dict = {
+    "linear": 1,
+    "embedding": 2,
+    "QuantizedLinear": 111
+}
+
+def create(model,
+           tokenizer = None,
+           pre_prompt = None,
+           user_role = None,
+           bot_role = None,
+           history_sep = None,
+           dtype = "float16"):
+    if (dtype not in fastllm_data_type_dict):
+        print("dtype should in ", list(fastllm_data_type_dict.keys()));
+        exit(0);
+
+    # 0.1 model info
+    # if model.config.model_type == "chatglm" and model.config.transformers_version == "4.30.2":
+    #    model.config.model_type = "chatglm3"
+    #    print("model.config.model_type: chatglm3!")
+    modelInfo = model.config.__dict__
+    if model.generation_config is not None:
+        modelInfo.update(model.generation_config.__dict__)
+    if (pre_prompt):
+        modelInfo["pre_prompt"] = pre_prompt;
+    if (user_role):
+        modelInfo["user_role"] = user_role;
+    if (bot_role):
+        modelInfo["bot_role"] = bot_role;
+    if (history_sep):
+        modelInfo["history_sep"] = history_sep;
+    if (modelInfo["model_type"] == "baichuan" and hasattr(model, "model") and hasattr(model.model, "get_alibi_mask")):
+        # Baichuan 2代
+        modelInfo["use_alibi"] = "1";
+        modelInfo["pre_prompt"] = "";
+        modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.user_token_id) + "> ") if hasattr(model.generation_config, "user_token_id") else "";
+        modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.assistant_token_id) + ">") if hasattr(model.generation_config, "assistant_token_id") else "";
+        modelInfo["history_sep"] = "";
+    if (modelInfo["model_type"] == "qwen"):
+        if modelInfo["chat_format"] == "chatml":
+            modelInfo["im_end_id"] = tokenizer.im_end_id
+            modelInfo["im_start_id"] = tokenizer.im_start_id
+    if (modelInfo["model_type"] == "chatglm" and hasattr(tokenizer, "build_chat_input")):
+        # chatglm3
+        modelInfo["pre_prompt"] = "";
+        modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(tokenizer.get_command("<|user|>")) + ">\n");
+        modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(tokenizer.get_command("<|assistant|>")) + ">");
+        modelInfo["history_sep"] = "";
+
+
+    weight_type_dict = {};
+    module_dict = {};
+    weight_bits = {};
+    for key, m in model.named_modules():
+        if (str(type(m)).find("QuantizedLinear") != -1):
+            weight_type_dict[key + ".weight"] = "QuantizedLinear";
+            weight_bits[key + ".weight"] = m.weight_bit_width;
+        if (isinstance(m, torch.nn.Linear)):
+            weight_type_dict[key + ".weight"] = "linear";
+            module_dict[key + ".weight"] = m;
+        if (isinstance(m, torch.nn.Embedding)):
+            weight_type_dict[key] = "embedding";
+
+    peft_config = {}
+    active_adapter = ""
+    if hasattr(model, "peft_config"):
+        peft_config = model.peft_config
+    if hasattr(model, "active_adapter") and isinstance(model.active_adapter, str):
+        # in transformers >= 4.33.0, active_adapter is a funtion in model, ignore it now
+        active_adapter = model.active_adapter
+
+    model = model.cpu();
+    dict = model.state_dict();
+    model_type = model.config.__dict__["model_type"];
+    model = llm.fastllm_lib.create_empty_llm_model(model_type.encode());
+    for it in modelInfo.keys():
+        llm.fastllm_lib.add_dict_llm_model(model, str(it).encode(), str(modelInfo[it]).encode());
+
+    for adapter_name in peft_config.keys():
+        adapter_dict = peft_config[adapter_name].__dict__
+        for it in adapter_dict.keys():
+            llm.fastllm_lib.add_adapter_dict_llm_model(model, str(adapter_name).encode(), str(it).encode(), str(adapter_dict[it]).encode())
+    if len(active_adapter) != 0:
+        llm.fastllm_lib.set_adapter(model, str(active_adapter).encode())
+
+    # 1. vocab
+    if (tokenizer):
+        if (hasattr(tokenizer, "tokenizer")):
+            if modelInfo["model_type"] == "qwen":
+                pass
+            else:
+                tokenizer = tokenizer.tokenizer;
+        if (hasattr(tokenizer, "sp_model")):
+            piece_size = tokenizer.sp_model.piece_size();
+            for i in range(piece_size):
+                llm.fastllm_lib.add_tokenizer_word_llm_model(model, tokenizer.sp_model.id_to_piece(i).encode(),
+                                                             i, ctypes.c_float(tokenizer.sp_model.get_score(i)));
+        else:
+            vocab = tokenizer.get_vocab();
+            for v in vocab.keys():
+                if (modelInfo["model_type"] == "moss"):
+                    vv = [(ord(c) if c not in tokenizer.byte_decoder else tokenizer.byte_decoder[c]) for c in v];
+                    llm.fastllm_lib.add_tokenizer_word_llm_model(model, vv, vocab[v], ctypes.c_float(1.0));
+                elif (modelInfo["model_type"] == "qwen"):
+                    llm.fastllm_lib.add_tokenizer_word_llm_model(model, v, vocab[v], ctypes.c_float(1.0));
+                else:
+                    llm.fastllm_lib.add_tokenizer_word_llm_model(model, v.encode(), vocab[v], ctypes.c_float(1.0));
+    tot = 0;
+    for key in dict:
+        ori_data_type = 0;
+        ori_np_data_type = np.float32;
+        cur_weight_type = 0;
+        if (key in weight_type_dict and weight_type_dict[key] in fastllm_weight_type_dict):
+            cur_weight_type = fastllm_weight_type_dict[weight_type_dict[key]];
+        to_data_type = 0;
+
+        if (cur_weight_type == 1):
+            to_data_type = fastllm_data_type_dict[dtype];
+            if (to_data_type == 7):
+                ori_data_type = 7;
+                ori_np_data_type = np.float16;
+        elif (cur_weight_type == 2):
+            # TODO bfloat
+            to_data_type = 0;
+
+        weight_name = key
+        if peft_config is not None:
+            weight_name = weight_name.replace('base_model.model.', '')
+        if (cur_weight_type == 111):
+            llm.fastllm_lib.add_qlinear_weight_llm_model(model, weight_name.encode(),
+                                                 len(dict[key].shape),
+                                                 (ctypes.c_int * len(dict[key].shape))(*list(dict[key].shape)),
+                                                 weight_bits[key],
+                                                 dict[key + "_scale"].numpy().astype(np.float32).ctypes.data_as(ctypes.c_void_p),
+                                                 dict[key].numpy().ctypes.data_as(ctypes.c_void_p));
+        else:
+            llm.fastllm_lib.add_weight_llm_model(model, weight_name.encode(),
+                                             len(dict[key].shape),
+                                             (ctypes.c_int * len(dict[key].shape))(*list(dict[key].shape)),
+                                             to_data_type, cur_weight_type, ori_data_type,
+                                             dict[key].numpy().astype(ori_np_data_type).ctypes.data_as(ctypes.c_void_p));
+        tot += 1;
+        print("convert (", tot, "/", len(dict), end = " )\r");
+
+    print("");
+    llm.fastllm_lib.init_params_llm_model(model);
+    llm.fastllm_lib.warmup_llm_model(model);
+    ret = llm.model("", id = model);
+    return ret;
+
--- a/package/fastllm_pytools/libfastllm_tools.so
+++ b/package/fastllm_pytools/libfastllm_tools.so
--- a/package/fastllm_pytools/llm.py
+++ b/package/fastllm_pytools/llm.py
+import ctypes;
+import math
+import os;
+import threading
+from typing import Optional, Tuple, Union, List, Callable, Dict, Any;
+from copy import deepcopy
+import json
+
+import platform
+if platform.system() == 'Windows':
+    fastllm_lib = ctypes.CDLL(os.path.join(os.path.split(os.path.realpath(__file__))[0], "fastllm_tools.dll"), winmode=0)
+elif platform.system() == 'Darwin':
+    fastllm_lib = ctypes.cdll.LoadLibrary(os.path.join(os.path.split(os.path.realpath(__file__))[0], "libfastllm_tools.dylib"))
+else:
+    fastllm_lib = ctypes.cdll.LoadLibrary(os.path.join(os.path.split(os.path.realpath(__file__))[0], "libfastllm_tools.so"))
+
+fastllm_lib.create_llm_model.argtypes = [ctypes.c_char_p]
+fastllm_lib.create_llm_model.restype = ctypes.c_int
+
+fastllm_lib.token_decode.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_char_p]
+fastllm_lib.token_decode.restype = ctypes.c_int
+
+fastllm_lib.token_encode_string.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_int, ctypes.POINTER(ctypes.c_int)]
+fastllm_lib.token_encode_string.restype = ctypes.c_int
+
+fastllm_lib.launch_response_llm_model.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_void_p,
+                                                  ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int,
+                                                  ctypes.c_float, ctypes.c_float, ctypes.c_bool,
+                                                  ctypes.c_int, ctypes.POINTER(ctypes.c_int)]
+fastllm_lib.launch_response_llm_model.restype = ctypes.c_int
+
+fastllm_lib.fetch_response_llm_model.argtypes = [ctypes.c_int, ctypes.c_int]
+fastllm_lib.fetch_response_llm_model.restype = ctypes.c_int
+
+fastllm_lib.fetch_response_logits_llm_model.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.POINTER(ctypes.c_float)]
+fastllm_lib.fetch_response_logits_llm_model.restype = ctypes.c_int
+
+fastllm_lib.response_str_llm_model.argtypes = [ctypes.c_int, ctypes.c_char_p,
+                                               ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int,
+                                               ctypes.c_float, ctypes.c_float, ctypes.c_bool]
+# fastllm_lib.response_str_llm_model.restype = ctypes.c_char_p
+fastllm_lib.response_str_llm_model.restype = ctypes.POINTER(ctypes.c_char)
+
+fastllm_lib.launch_response_str_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p,
+                                                     ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int,
+                                                     ctypes.c_float, ctypes.c_float, ctypes.c_bool,
+                                                     ctypes.c_int, ctypes.POINTER(ctypes.c_int)]
+fastllm_lib.launch_response_str_llm_model.restype = ctypes.c_int
+
+fastllm_lib.fetch_response_str_llm_model.argtypes = [ctypes.c_int, ctypes.c_int]
+# fastllm_lib.fetch_response_str_llm_model.restype = ctypes.c_char_p
+fastllm_lib.fetch_response_str_llm_model.restype = ctypes.POINTER(ctypes.c_char)
+
+fastllm_lib.make_history_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p, ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p]
+# fastllm_lib.make_history_llm_model.restype = ctypes.c_char_p
+fastllm_lib.make_history_llm_model.restype = ctypes.POINTER(ctypes.c_char)
+
+fastllm_lib.make_input_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p, ctypes.c_int, ctypes.c_char_p]
+# fastllm_lib.make_input_llm_model.restype = ctypes.c_char_p
+fastllm_lib.make_input_llm_model.restype = ctypes.POINTER(ctypes.c_char)
+
+fastllm_lib.add_tokenizer_word_llm_model.argtype = [ctypes.c_int, ctypes.c_char_p, ctypes.c_float, ctypes.c_int]
+
+fastllm_lib.set_device_map.argtype = [ctypes.c_int, ctypes.c_void_p, ctypes.c_char_p, ctypes.c_void_p]
+
+fastllm_lib.get_llm_model_type.argtype = [ctypes.c_int]
+fastllm_lib.get_llm_model_type.restype = ctypes.POINTER(ctypes.c_char)
+
+fastllm_lib.response_batch_str_llm_model.argtypes = [ctypes.c_int, ctypes.POINTER(ctypes.c_char_p), ctypes.c_int,
+                                                     ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int,
+                                                     ctypes.c_float, ctypes.c_float, ctypes.c_bool]
+fastllm_lib.response_batch_str_llm_model.restype = ctypes.POINTER(ctypes.c_char_p)
+
+fastllm_lib.response_batch_tokens_llm_model.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.POINTER(ctypes.c_int), ctypes.POINTER(ctypes.c_int),
+                                                        ctypes.c_int, ctypes.c_bool, ctypes.c_float, ctypes.c_int,
+                                                        ctypes.c_float, ctypes.c_float, ctypes.c_bool]
+fastllm_lib.response_batch_tokens_llm_model.restype = ctypes.POINTER(ctypes.c_char_p)
+
+fastllm_lib.freeChars.argtype = [ctypes.POINTER(ctypes.c_char)]
+# fastllm_lib.freeChars.restype = ctypes.c_char_p
+
+fastllm_lib.freeCharArray.argtype = [ctypes.POINTER(ctypes.c_char_p)]
+
+def set_cpu_threads(threads: int):
+    fastllm_lib.set_cpu_threads(threads);
+
+def get_cpu_threads() -> int:
+    return fastllm_lib.get_cpu_threads();
+
+def print_ins_info():
+    fastllm_lib.print_cpu_ins();
+
+def set_cpu_kvcache(cpu_kvcache):
+    fastllm_lib.set_kvcache_in_cpu(ctypes.c_bool(cpu_kvcache));
+
+def get_cpu_kvcache():
+    return fastllm_lib.get_kvcache_in_cpu();
+
+def set_cpu_low_mem(low_mem):
+    fastllm_lib.set_cpu_low_mem(ctypes.c_bool(low_mem));
+
+def get_cpu_low_mem():
+    return fastllm_lib.get_cpu_low_mem();
+
+def set_device_map(device_map):
+    devices = [];
+    values = [];
+    if (isinstance(device_map, str)):
+        devices.append(device_map);
+        values.append(1);
+    elif (isinstance(device_map, list)):
+        devices = [str(x) for x in device_map];
+        values = [1 for x in device_map];
+    elif (isinstance(device_map, dict)):
+        devices = [str(x) for x in device_map.keys()];
+        values = [int(device_map[x]) for x in device_map.keys()];
+    else:
+        print("set_device_map error.");
+        return;
+    device_str = ''.join(devices);
+    device_len = [len(x) for x in devices];
+    fastllm_lib.set_device_map(len(device_len),
+                               (ctypes.c_int * len(device_len))(*device_len),
+                               device_str.encode(),
+                               (ctypes.c_int * len(values))(*values));
+def from_hf(model,
+            tokenizer = None,
+            dtype = "float16"):
+    from fastllm_pytools import hf_model;
+    return hf_model.create(model, tokenizer, dtype = dtype);
+
+class model:
+    def __init__ (self, path : str,
+                  id : int = -99999):
+        if (id != -99999):
+            self.model = id;
+        else:
+            self.model = fastllm_lib.create_llm_model(path.encode());
+        self.direct_query = False;
+
+        # 为了减少重复申请释放buffer对象而使用的线程局部存储区对象池
+        self.thread_local_obj = threading.local()
+        self.thread_local_obj.tokenizer_encode_string__output_buffer = None
+        self.thread_local_obj.tokenizer_decode_token__output_buffer = None
+
+        # tokenizer_decode_token 输出结果的静态缓存，手工触发构建
+        # 由于token数量有限且不太多，所以缓存该结果来减少调用较为适合。
+        # 不做成自动缓存是为了避免在多线程调用的时候对缓存dict加锁，同时也为不同场景提供选择空间
+        self.tokenizer_decode_token_cache = None
+
+        model_type_ptr = fastllm_lib.get_llm_model_type(self.model)
+        self.model_type = ctypes.string_at(model_type_ptr).decode()
+        fastllm_lib.freeChars(model_type_ptr)
+        # print("model_type:", self.model_type)
+
+    def get_prompt(self,
+                   query: str,
+                   history: List[Tuple[str, str]] = None) -> str:
+        if (not(history)):
+            history = [];
+        prompt = "";
+        for i, (old_query, response) in enumerate(history):
+            history_ptr = fastllm_lib.make_history_llm_model(self.model, prompt.encode(), i, old_query.encode(), response.encode())
+            prompt = ctypes.string_at(history_ptr).decode()
+            fastllm_lib.freeChars(history_ptr)
+        
+        input_ptr = fastllm_lib.make_input_llm_model(self.model, prompt.encode(), len(history), query.encode())
+        prompt = ctypes.string_at(input_ptr).decode()
+        fastllm_lib.freeChars(input_ptr)
+        return prompt;
+
+    def save(self, path : str):
+        fastllm_lib.save_llm_model(self.model, path.encode());
+
+    def eval(self):
+        pass;
+
+    def build_tokenizer_decode_token_cache(self):
+        if self.tokenizer_decode_token_cache is not None:
+            return
+
+        cache_dict = dict()
+        vocab_size = fastllm_lib.get_tokenizer_vocab_size(self.model)
+        for token_id in range(vocab_size):
+            cache_dict[token_id] = self.tokenizer_decode_token(token_id)
+
+        self.tokenizer_decode_token_cache = cache_dict
+
+    def tokenizer_encode_string(self, content: str) -> List[int]:
+        output_buffer_init_len = 1024
+        if self.thread_local_obj.tokenizer_encode_string__output_buffer is None:
+            self.thread_local_obj.tokenizer_encode_string__output_buffer = (ctypes.c_int * output_buffer_init_len)()
+
+        buffer = self.thread_local_obj.tokenizer_encode_string__output_buffer
+        buffer_len = len(buffer)
+        result_len = fastllm_lib.token_encode_string(self.model, content.encode(), buffer_len, buffer)
+        if result_len > buffer_len:
+            if result_len > 10240:
+                # 要处理的数据过长，使用一次性的buffer
+                temp_buffer = (ctypes.c_int * result_len)()
+                ret = fastllm_lib.token_encode_string(self.model, content.encode(), result_len, temp_buffer)
+                return [i for i in temp_buffer]
+            else:
+                # 扩展buffer大小
+                new_buffer_len = round(math.ceil(result_len / 1024.0)) * 1024
+                buffer = (ctypes.c_int * new_buffer_len)()
+                self.thread_local_obj.tokenizer_encode_string__output_buffer = buffer
+                result_len = fastllm_lib.token_encode_string(self.model, content.encode(), new_buffer_len, buffer)
+
+        return [buffer[i] for i in range(result_len)]
+
+    def tokenizer_decode_token(self, token_id: int) -> bytes:
+        if self.tokenizer_decode_token_cache is not None:
+            cache_result = self.tokenizer_decode_token_cache.get(token_id)
+            if cache_result is not None:
+                return cache_result
+
+        output_buffer_init_len = 256
+        if self.thread_local_obj.tokenizer_decode_token__output_buffer is None:
+            self.thread_local_obj.tokenizer_decode_token__output_buffer = ctypes.create_string_buffer(output_buffer_init_len)
+
+        buffer = self.thread_local_obj.tokenizer_decode_token__output_buffer
+        ret = fastllm_lib.token_decode(self.model, token_id, len(buffer), buffer)
+        if ret > 0:
+            # buffer长度不够，扩展buffer大小
+            new_buffer_len = round(math.ceil(ret / 16.0)) * 16
+            buffer = ctypes.create_string_buffer(new_buffer_len)
+            self.thread_local_obj.tokenizer_decode_token__output_buffer = buffer
+            ret = fastllm_lib.token_decode(self.model, token_id, len(buffer), buffer)
+            assert ret == 0
+
+        buffer_bytes = buffer.raw
+        result_len = len(buffer_bytes)
+        for i in range(len(buffer_bytes)):
+            if buffer_bytes[i] == 0:
+                result_len = i
+                break
+        return buffer_bytes[:result_len]
+
+    def stop_token_ctypes(self, stop_token_ids):
+        if stop_token_ids is None:
+            return 0, None
+        else:
+            return ctypes.c_int(len(stop_token_ids)), (ctypes.c_int * len(stop_token_ids))(*stop_token_ids)
+        
+    def response_logits(self,
+                        query: str,
+                        history: List[Tuple[str, str]] = None,
+                        tokenizer = None,
+                        stop_token_ids: List[int] = None,
+                        ) -> str:
+        prompt = query if self.direct_query else self.get_prompt(query, history);
+        stop_token_len, stop_token_list = self.stop_token_ctypes(stop_token_ids)
+        if (tokenizer == None):
+            handle = fastllm_lib.launch_response_str_llm_model(self.model, prompt.encode(),
+                                                           ctypes.c_int(1), ctypes.c_bool(False), ctypes.c_float(1), ctypes.c_int(1),
+                                                           ctypes.c_float(1), ctypes.c_float(1), ctypes.c_bool(True),
+                                                           stop_token_len, stop_token_list);
+        else:
+            input = tokenizer.encode(prompt);
+            handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
+                                                           1, False, 1, 1, 1, 1, True, stop_token_len, stop_token_list);
+        vocab_size = fastllm_lib.get_tokenizer_vocab_size(self.model);
+        logits = list(range(vocab_size))
+        array = (ctypes.c_float * (vocab_size * 4))(*logits);
+        ret = fastllm_lib.fetch_response_logits_llm_model(self.model, handle, array);
+        out = list(array)[:vocab_size];
+        while (ret != -1):
+            ret = fastllm_lib.fetch_response_logits_llm_model(self.model, handle, array);
+        return out;
+
+    def response(self,
+                 query: str,
+                 history: List[Tuple[str, str]] = None,
+                 max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01,
+                 stop_token_ids: List[int] = None) -> str:
+        ret = "";
+        for i in self.stream_response(query = query,
+                                      history = history,
+                                      max_length = max_length,
+                                      do_sample = do_sample,
+                                      top_p = top_p, top_k = top_k,
+                                      temperature = temperature,
+                                      repeat_penalty = repeat_penalty,
+                                      one_by_one = True,
+                                      stop_token_ids = stop_token_ids):
+            ret += i;
+        return ret;
+
+    def stream_response(self,
+                        query: str,
+                        history: List[Tuple[str, str]] = None,
+                        max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01,
+                        one_by_one = True, stop_token_ids: List[int] = None):
+        prompt = query if self.direct_query else self.get_prompt(query, history);
+        stop_token_len, stop_token_list = self.stop_token_ctypes(stop_token_ids);
+        handle = fastllm_lib.launch_response_str_llm_model(self.model, prompt.encode(),
+                                                           ctypes.c_int(max_length), ctypes.c_bool(do_sample), ctypes.c_float(top_p), ctypes.c_int(top_k),
+                                                           ctypes.c_float(temperature), ctypes.c_float(repeat_penalty), ctypes.c_bool(False),
+                                                           stop_token_len, stop_token_list);
+        res = "";
+        ret = b'';
+        fail_cnt = 0;
+        while True:
+            # ret += fastllm_lib.fetch_response_str_llm_model(self.model, handle);
+            ret_chararry = fastllm_lib.fetch_response_str_llm_model(self.model, handle);
+            ret += ctypes.string_at(ret_chararry)
+            fastllm_lib.freeChars(ret_chararry)
+            cur = "";
+            try:
+                cur = ret.decode()
+                ret = b'';
+            except:
+                fail_cnt += 1;
+                if (fail_cnt == 20):
+                    break;
+                else:
+                    continue;
+            fail_cnt = 0;
+            if (cur == "<flmeos>"):
+                break;
+            if one_by_one:
+                yield cur;
+            else:
+                res += cur;
+                yield res;
+
+    def stream_response_raw(self,
+                            input_tokens: List[int],
+                            max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01,
+                            one_by_one = True,
+                            stop_token_ids: List[int] = None
+                            ):
+        stop_token_len, stop_token_list = self.stop_token_ctypes(stop_token_ids)
+        handle = fastllm_lib.launch_response_llm_model(self.model, len(input_tokens),
+                                                       (ctypes.c_int * len(input_tokens))(*input_tokens),
+                                                       ctypes.c_int(max_length), ctypes.c_bool(do_sample), ctypes.c_float(top_p), ctypes.c_int(top_k),
+                                                       ctypes.c_float(temperature), ctypes.c_float(repeat_penalty), ctypes.c_bool(False),
+                                                       stop_token_len, stop_token_list)
+
+        # 可能遇到长尾char需要多个token才能够生成，所以只返回bytes，string.decode策略交给外部
+        # 方便统计输出token数量，和控制不完整utf8时候解码的逻辑
+
+        total_bytes = b''
+        while True:
+            cur_token = fastllm_lib.fetch_response_llm_model(self.model, handle)
+            if cur_token == -1:
+                break
+
+            cur_bytes = self.tokenizer_decode_token(cur_token)
+
+            if one_by_one:
+                yield cur_bytes
+            else:
+                total_bytes += cur_bytes
+                yield total_bytes
+
+    def chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, max_length: int = 8192,
+             do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01, stop_token_ids: List[int] = None, **kwargs):
+        if self.model_type  != "chatglm3":
+            if (not(history)):
+                history = [];
+            prompt = query if self.direct_query else self.get_prompt(query, history);
+            input = tokenizer.encode(prompt);
+            stop_token_len, stop_token_list = self.stop_token_ctypes(stop_token_ids)
+            handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
+                                                           max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
+                                                       	   False, stop_token_len, stop_token_list);
+
+            result = [];
+            while True:
+                cur = fastllm_lib.fetch_response_llm_model(self.model, handle);
+                if (cur == -1):
+                    break;
+                result.append(cur);
+            response = tokenizer.decode(result);
+            history = history + [(query, response)];
+            return response, history;
+        else:
+            if history is None:
+                history = []
+            role = "user"
+            input = self.build_chatglm3_input(tokenizer, query, history=history, role=role)
+            history.append({"role": role, "content": query})			
+            stop_token_len, stop_token_list = self.stop_token_ctypes(stop_token_ids)
+            handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
+                                                           max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
+                                                           False, stop_token_len, stop_token_list);
+            tokens = [];
+            while True:
+                cur = fastllm_lib.fetch_response_llm_model(self.model, handle);
+                if (cur == -1):
+                    break;
+                tokens.append(cur);
+            response = tokenizer.decode(tokens);
+            if response and response[-1] != "�":
+                response, new_history = self.process_chatglm3_response(response, history)
+                return response, new_history
+
+    def stream_chat(self, tokenizer, query: str, history: List[Tuple[str, str]] = None, past_key_values = None,
+                    max_length: int = 8192, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01,
+                    return_past_key_values = False, stop_token_ids: List[int] = None, **kwargs) -> str:
+        if self.model_type  != "chatglm3":
+            if (not(history)):
+                history = [];
+            prompt = query if self.direct_query else self.get_prompt(query, history);
+            input = tokenizer.encode(prompt);
+            stop_token_len, stop_token_list = self.stop_token_ctypes(stop_token_ids)
+            handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
+                                                           max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
+                                                           False, stop_token_len, stop_token_list);
+            tokens = [];
+            while True:
+                cur = fastllm_lib.fetch_response_llm_model(self.model, handle);
+                if (cur == -1):
+                    break;
+                tokens.append(cur);
+                response = tokenizer.decode(tokens);
+                new_history = history + [(query, response)];
+                if return_past_key_values:
+                    yield response, new_history, None;
+                else:
+                    yield response, new_history;
+        else:
+            if history is None:
+                history = []
+            role = "user"
+            input = self.build_chatglm3_input(tokenizer, query, history=history, role=role)
+            history.append({"role": role, "content": query})
+            stop_token_len, stop_token_list = self.stop_token_ctypes(stop_token_ids)
+
+            handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
+                                                           max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
+                                                           False, stop_token_len, stop_token_list);
+            tokens = [];
+            while True:
+                cur = fastllm_lib.fetch_response_llm_model(self.model, handle);
+                if (cur == -1):
+                    break;
+                tokens.append(cur);
+                response = tokenizer.decode(tokens);
+                if response and response[-1] != "�":
+                    response, new_history = self.process_chatglm3_response(response, history)
+                    if return_past_key_values:
+                        yield response, new_history, past_key_values
+                    else:
+                        yield response, new_history
+
+
+    def set_adapter(self, name: str):
+        fastllm_lib.set_adapter(self.model, str(name).encode())
+
+    def disable_adapter(self):
+        fastllm_lib.disable_adapter(self.model)
+
+    def process_chatglm3_response(self, output, history):
+        content = ""
+        history = deepcopy(history)
+        for response in output.split("<|assistant|>"):
+            metadata, content = response.split("\n", maxsplit=1)
+            if not metadata.strip():
+                content = content.strip()
+                history.append({"role": "assistant", "metadata": metadata, "content": content})
+                content = content.replace("[[训练时间]]", "2023年")
+            else:
+                history.append({"role": "assistant", "metadata": metadata, "content": content})
+                if history[0]["role"] == "system" and "tools" in history[0]:
+                    content = "\n".join(content.split("\n")[1:-1])
+                    def tool_call(**kwargs):
+                        return kwargs
+                    parameters = eval(content)
+                    content = {"name": metadata.strip(), "parameters": parameters}
+                else:
+                    content = {"name": metadata.strip(), "content": content}
+        return content, history
+
+    def build_chatglm3_input(self, tokenizer, query, history=None, role="user"):
+        if history is None:
+            history = []
+        input_ids = []
+        for item in history:
+            content = item["content"]
+            if item["role"] == "system" and "tools" in item:
+                content = content + "\n" + json.dumps(item["tools"], indent=4, ensure_ascii=False)
+            input_ids.extend(tokenizer.build_single_message(item["role"], item.get("metadata", ""), content))
+        input_ids.extend(tokenizer.build_single_message(role, "", query))
+        input_ids.extend([tokenizer.get_command("<|assistant|>")])
+        return input_ids
+
+    def response_batch_raw(self, querys: List[str],
+                       historys: List[List[Tuple[str, str]]] = None,
+                       max_length: int = 1024, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01,
+                       **kwargs) -> List[str]:
+        query_size = len(querys)
+        if (not(historys)):
+            historys = [[] for _ in range(query_size)]
+        inputs = (ctypes.c_char_p * query_size)()
+        for i, query in enumerate(querys):
+            prompt = query if self.direct_query else self.get_prompt(query, historys[i])
+            inputs[i] = ctypes.c_char_p(prompt.encode())
+
+        outputs = fastllm_lib.response_batch_str_llm_model(self.model, inputs, query_size,
+                                                           max_length, do_sample, top_p, top_k, temperature, repeat_penalty, False)
+
+        responses = []
+        for i in range(query_size):
+            response = ctypes.string_at(outputs[i]).decode()
+            responses.append(response)
+            historys[i] = historys[i] + [(querys[i], response)]
+        fastllm_lib.freeCharArray(outputs, query_size)
+        return responses, historys
+
+    def chat_batch_raw(self, tokenizer, querys: List[str], historys: List[List[Tuple[str, str]]] = None, max_length: int = 1024,
+                   do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01, **kwargs):
+        query_size = len(querys)
+        if (not(historys)):
+            historys = [[] for _ in range(query_size)]
+
+        inputs = []
+        inputs_len = []
+        for i, query in enumerate(querys):
+            prompt = query if self.direct_query else self.get_prompt(query, historys[i])
+            input = tokenizer.encode(prompt);
+            inputs.extend(input)
+            inputs_len.append(len(input))
+
+        outputs = fastllm_lib.response_batch_tokens_llm_model(self.model, query_size,
+                                                                (ctypes.c_int * len(inputs_len))(*inputs_len),
+                                                                (ctypes.c_int * len(inputs))(*inputs),
+                                                                max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
+                                                                False)
+
+        responses = []
+        for i in range(query_size):
+            response = ctypes.string_at(outputs[i]).decode()
+            responses.append(response)
+            historys[i] = historys[i] + [(querys[i], response)]
+        fastllm_lib.freeCharArray(outputs, query_size)
+        return responses, historys
+
+    def response_batch(self, querys: List[str],
+                       historys: List[List[Tuple[str, str]]] = None,
+                       max_length: int = 1024, do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01,
+                       stop_token_ids: List[int] = None, **kwargs) -> List[str]:
+        query_size = len(querys)
+        if (not(historys)):
+            historys = [[] for _ in range(query_size)]
+        handles = []
+        for i, query in enumerate(querys):
+            prompt = query if self.direct_query else self.get_prompt(query, historys[i])
+            stop_token_len, stop_token_list = self.stop_token_ctypes(stop_token_ids);
+            handle = fastllm_lib.launch_response_str_llm_model(self.model, prompt.encode(),
+                                                           ctypes.c_int(max_length), ctypes.c_bool(do_sample), ctypes.c_float(top_p), ctypes.c_int(top_k),
+                                                           ctypes.c_float(temperature), ctypes.c_float(repeat_penalty), ctypes.c_bool(False),
+                                                           stop_token_len, stop_token_list)
+            handles.append(handle)
+
+        responses = []
+        for i, handle in enumerate(handles):
+            res = ""
+            ret = b''
+            fail_cnt = 0
+            while True:
+                # ret += fastllm_lib.fetch_response_str_llm_model(self.model, handle);
+                ret_chararry = fastllm_lib.fetch_response_str_llm_model(self.model, handle);
+                ret += ctypes.string_at(ret_chararry)
+                fastllm_lib.freeChars(ret_chararry)
+                cur = ""
+                try:
+                    cur = ret.decode()
+                    ret = b''
+                except:
+                    fail_cnt += 1
+                    if (fail_cnt == 20):
+                        break
+                    else:
+                        continue
+                fail_cnt = 0
+                if (cur == "<flmeos>"):
+                    break;
+                res += cur
+            responses.append(res)
+            historys[i] = historys[i] + [(querys[i], res)]
+
+        return responses, historys
+   
+
+    def chat_batch(self, tokenizer, querys: List[str], historys: List[List[Tuple[str, str]]] = None, max_length: int = 1024,
+                   do_sample = True, top_p = 0.8, top_k = 1, temperature = 1.0, repeat_penalty = 1.01, stop_token_ids: List[int] = None, **kwargs):
+        query_size = len(querys)
+        if (not(historys)):
+            historys = [[] for _ in range(query_size)]
+
+        handles = []
+        for i, query in enumerate(querys):
+            prompt = query if self.direct_query else self.get_prompt(query, historys[i])
+            input = tokenizer.encode(prompt);
+            stop_token_len, stop_token_list = self.stop_token_ctypes(stop_token_ids);
+            handle = fastllm_lib.launch_response_llm_model(self.model, len(input), (ctypes.c_int * len(input))(*input),
+                                                           max_length, do_sample, top_p, top_k, temperature, repeat_penalty,
+                                                           False, stop_token_len, stop_token_list);
+            handles.append(handle)
+
+        responses = []
+        for i, handle in enumerate(handles):
+            result = [];
+            while True:
+                cur = fastllm_lib.fetch_response_llm_model(self.model, handle);
+                if (cur == -1):
+                    break;
+                result.append(cur);
+            response = tokenizer.decode(result);
+            responses.append(response)
+            historys[i] = historys[i] + [(querys[i], response)]
+
+        return responses, historys
+    
+    def release_memory(self):
+        fastllm_lib.release_memory(self.model)
+    
--- a/package/fastllm_pytools/torch2flm.py
+++ b/package/fastllm_pytools/torch2flm.py
+import struct
+import numpy as np
+import torch
+
+def writeString(fo, s):
+    fo.write(struct.pack('i', len(s)))
+    fo.write(s.encode())
+
+def writeKeyValue(fo, key, value):
+    writeString(fo, key)
+    writeString(fo, value)
+
+fastllm_data_type_dict = {
+    "int4": 8,
+    "int8": 3,
+    "float16": 7,
+    "float32": 0,
+}
+fastllm_weight_type_dict = {
+    "linear": 1,
+    "embedding": 2
+}
+
+v = np.random.randint(-127, 127, [10, 20]);
+temp = v;
+c_max = np.expand_dims(np.abs(v).max(axis = -1), -1)
+c_scale = c_max / 127.0
+v = (v / c_scale + 128.5).clip(1, 255).astype(np.uint8)
+
+def write_int8(fo, v):
+    c_max = np.expand_dims(np.abs(v).max(axis = -1), -1).clip(0.1, 1e100)
+    c_scale = c_max / 127.0
+    v = (v / c_scale + 128.5).clip(1, 255).astype(np.uint8)
+    fo.write(struct.pack('i', 3))
+    fo.write(struct.pack('i', 0))
+    for i in range(c_max.shape[0]):
+        fo.write(struct.pack('f', -c_max[i][0]));
+        fo.write(struct.pack('f', c_max[i][0]));
+    fo.write(v.data)
+
+def write_int4(fo, v):
+    # c_min = np.expand_dims(-np.abs(v).max(axis = -1), -1)
+    # c_max = np.expand_dims(np.abs(v).max(axis = -1), -1)
+    # c_scale = c_max / 7.0
+    # c_min = c_scale * -8.0
+
+    c_min = np.expand_dims(v.min(axis = -1), -1)
+    c_max = np.expand_dims(v.max(axis = -1), -1)
+    c_scale = (c_max - c_min) / 15.0
+    c_zero = np.round(0.0 - c_min / c_scale)
+    c_zero = c_zero.clip(0, 15)
+    c_min = -c_scale * c_zero
+
+    v = (v - c_min) / c_scale
+    v = (v + 0.5).astype(np.int8).clip(0, 15).astype(np.uint8)
+    v = v[:, 0::2] * 16 + v[:, 1::2]
+    fo.write(struct.pack('i', 8))
+    fo.write(struct.pack('i', 0))
+    for i in range(c_min.shape[0]):
+        fo.write(struct.pack('f', c_min[i][0]));
+        fo.write(struct.pack('f', c_max[i][0]));
+    fo.write(v.data)
+
+def tofile(exportPath,
+           model,
+           tokenizer = None,
+           pre_prompt = None,
+           user_role = None,
+           bot_role = None,
+           history_sep = None,
+           dtype = "float16"):
+    if (dtype not in fastllm_data_type_dict):
+        print("dtype should in ", list(fastllm_data_type_dict.keys()))
+        exit(0)
+
+    dict = model.state_dict()
+    fo = open(exportPath, "wb")
+
+    # 0. version id
+    fo.write(struct.pack('i', 2))
+
+    # 0.1 model info
+    #if model.config.model_type == "chatglm" and model.config.transformers_version == "4.30.2":
+    #    model.config.model_type = "chatglm3"
+    modelInfo = model.config.__dict__
+    if model.generation_config is not None:
+        modelInfo.update(model.generation_config.__dict__)
+    if ("model_type" not in modelInfo):
+        print("unknown model_type.")
+        exit(0)
+
+    if (pre_prompt):
+        modelInfo["pre_prompt"] = pre_prompt
+    if (user_role):
+        modelInfo["user_role"] = user_role
+    if (bot_role):
+        modelInfo["bot_role"] = bot_role
+    if (history_sep):
+        modelInfo["history_sep"] = history_sep
+    if (modelInfo["model_type"] == "baichuan" and hasattr(model, "model") and hasattr(model.model, "get_alibi_mask")):
+        # Baichuan 2代
+        modelInfo["use_alibi"] = "1"
+        modelInfo["pre_prompt"] = ""
+        modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.user_token_id) + ">") if hasattr(model.generation_config, "user_token_id") else "";
+        modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.assistant_token_id) + ">") if hasattr(model.generation_config, "assistant_token_id") else "";
+        modelInfo["history_sep"] = ""
+    if (modelInfo["model_type"] == "baichuan" and modelInfo["vocab_size"] == 125696):
+        # Baichuan 2代 7B
+        modelInfo["pre_prompt"] = ""
+        modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.user_token_id) + ">") if hasattr(model.generation_config, "user_token_id") else "";
+        modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(model.generation_config.assistant_token_id) + ">") if hasattr(model.generation_config, "assistant_token_id") else "";
+        modelInfo["history_sep"] = ""
+    if modelInfo["model_type"] == "qwen":
+        if modelInfo["chat_format"] == "chatml":
+            modelInfo["im_end_id"] = tokenizer.im_end_id
+            modelInfo["im_start_id"] = tokenizer.im_start_id
+    if (modelInfo["model_type"] == "chatglm" and hasattr(tokenizer, "build_chat_input")):
+        print("chatglm3")
+        # chatglm3
+        modelInfo["pre_prompt"] = "";
+        modelInfo["user_role"] = ("<FLM_FIX_TOKEN_" + str(tokenizer.get_command("<|user|>")) + ">\n");
+        modelInfo["bot_role"] = ("<FLM_FIX_TOKEN_" + str(tokenizer.get_command("<|assistant|>")) + ">");
+        modelInfo["history_sep"] = "";
+
+    modelInfo["tokenizer_use_score"] = "1" # 分词带分数
+
+    if hasattr(model, "peft_config"):
+        adapter_size = len(model.peft_config)
+        modelInfo["peft_size"] = adapter_size
+
+    fo.write(struct.pack('i', len(modelInfo)))
+    for it in modelInfo.keys():
+        writeKeyValue(fo, str(it), str(modelInfo[it]))
+
+    if hasattr(model, "peft_config"):
+        for adapter_name in model.peft_config.keys():
+            adapter_dict = model.peft_config[adapter_name].__dict__
+            writeString(fo, adapter_name)
+            fo.write(struct.pack('i', len(adapter_dict)))
+            for it in adapter_dict.keys():
+                writeKeyValue(fo, str(it), str(adapter_dict[it]))
+
+    # 1. vocab
+    if (tokenizer):
+        if (hasattr(tokenizer, "tokenizer")):
+            if (modelInfo['model_type'] == "qwen"):
+                pass
+            else:
+                tokenizer = tokenizer.tokenizer
+        if (hasattr(tokenizer, "sp_model")):
+            piece_size = tokenizer.sp_model.piece_size()
+            fo.write(struct.pack('i', piece_size))
+            for i in range(piece_size):
+                s = tokenizer.sp_model.id_to_piece(i).encode()
+                fo.write(struct.pack('i', len(s)))
+                for c in s:
+                    fo.write(struct.pack('i', c))
+                fo.write(struct.pack('i', i))
+                fo.write(struct.pack('f', float(tokenizer.sp_model.get_score(i))))
+        else:
+            vocab = tokenizer.get_vocab()
+            fo.write(struct.pack('i', len(vocab)))
+            for v in vocab.keys():
+                if (modelInfo['model_type'] == "qwen"):
+                    s = v
+                elif (modelInfo["model_type"] == "moss"):
+                    s = [(ord(c) if c not in tokenizer.byte_decoder else tokenizer.byte_decoder[c]) for c in v]
+                else:
+                    s = v.encode()
+                fo.write(struct.pack('i', len(s)))
+                for c in s:
+                    fo.write(struct.pack('i', c))
+                fo.write(struct.pack('i', vocab[v]))
+                fo.write(struct.pack('f', 1.0))
+    else:
+        fo.write(struct.pack('i', 0))
+
+    weight_type_dict = {}
+    module_dict = {}
+    for key, m in model.named_modules():
+        if (isinstance(m, torch.nn.Linear)):
+            weight_type_dict[key + ".weight"] = "linear"
+            module_dict[key + ".weight"] = m
+        if (isinstance(m, torch.nn.Embedding)):
+            weight_type_dict[key] = "embedding"
+
+    # 2. weight
+    fo.write(struct.pack('i', len(dict)))
+    tot = 0
+    for key in dict:
+        ori_data_type = 0
+        ori_np_data_type = np.float32
+        cur_weight_type = 0
+        if (key in weight_type_dict and weight_type_dict[key] in fastllm_weight_type_dict):
+            cur_weight_type = fastllm_weight_type_dict[weight_type_dict[key]]
+        to_data_type = 0
+        if (cur_weight_type == 1):
+            to_data_type = fastllm_data_type_dict[dtype]
+            if (to_data_type == 7):
+                ori_data_type = 7
+                ori_np_data_type = np.float16
+
+        cur = dict[key].numpy().astype(ori_np_data_type)
+        
+        if hasattr(model, "peft_config"):
+            weight_name = key.replace('base_model.model.', '')
+            fo.write(struct.pack('i', len(weight_name)))
+            fo.write(weight_name.encode())
+        else:
+            fo.write(struct.pack('i', len(key)))
+            fo.write(key.encode())
+        fo.write(struct.pack('i', len(cur.shape)))
+        for i in cur.shape:
+            fo.write(struct.pack('i', i))
+        if (to_data_type == 3):
+            write_int8(fo, cur)
+        elif (to_data_type == 8):
+            write_int4(fo, cur)
+        else:
+            fo.write(struct.pack('i', to_data_type))
+            fo.write(cur.data)
+        tot += 1
+        print("output (", tot, "/", len(dict), end = " )\r")
+    print("\nfinish.")
+    fo.close()
\ No newline at end of file
--- a/package/setup.py
+++ b/package/setup.py
+from setuptools import setup, find_packages
+
+setup (
+    name = "fastllm_pytools",
+    version = "0.0.1",
+    description = "Fastllm pytools",
+    packages = ['fastllm_pytools'],
+    url = "https://developer.hpccube.com/codes/aicomponent/fastllm",
+    package_data = {
+        '': ['*.dll', '*.so']
+    }
+)
--- a/requirements.txt
+++ b/requirements.txt
+transformers==4.30.2
+streamlit>=1.24.0
+sentencepiece
+urllib3==1.26.16
--- a/web_demo.py
+++ b/web_demo.py
+import streamlit as st
+from streamlit_chat import message
+from fastllm_pytools import llm
+import sys
+
+st.set_page_config(
+    page_title="fastllm web demo",
+    page_icon=":robot:"
+)
+
+@st.cache_resource
+def get_model():
+    model = llm.model(sys.argv[1])
+    return model
+
+if "messages" not in st.session_state:
+    st.session_state.messages = []
+
+for i, (prompt, response) in enumerate(st.session_state.messages):
+    with st.chat_message("user"):
+        st.markdown(prompt)
+    with st.chat_message("assistant"):
+        st.markdown(response)
+
+if prompt := st.chat_input("请开始对话"):
+    model = get_model()
+    with st.chat_message("user"):
+        st.markdown(prompt)
+
+    with st.chat_message("assistant"):
+        message_placeholder = st.empty()
+        full_response = ""
+        for chunk in model.stream_response(prompt, st.session_state.messages, one_by_one = True):
+            full_response += chunk
+            message_placeholder.markdown(full_response + "▌")
+        message_placeholder.markdown(full_response)
+    st.session_state.messages.append((prompt, full_response))