Commit 467ec853 authored by lvzhen's avatar lvzhen
Browse files

Merge branch 'master' into 'master'

ChatGLM3-6B 微调训练

See merge request !2
parents 971c0aee 0006ad16
name: "\U0001F41B Bug Report"
description: Submit a bug report to help us improve ChatGLM3 / 提交一个 Bug 问题报告来帮助我们改进 ChatGLM3
body:
- type: textarea
id: system-info
attributes:
label: System Info / 系統信息
description: Your operating environment / 您的运行环境信息
placeholder: Includes Cuda version, Transformers version, Python version, operating system, hardware information (if you suspect a hardware problem)... / 包括Cuda版本,Transformers版本,Python版本,操作系统,硬件信息(如果您怀疑是硬件方面的问题)...
validations:
required: true
- type: textarea
id: who-can-help
attributes:
label: Who can help? / 谁可以帮助到您?
description: |
Your issue will be replied to more quickly if you can figure out the right person to tag with @
All issues are read by one of the maintainers, so if you don't know who to tag, just leave this blank and our maintainer will ping the right person.
Please tag fewer than 3 people.
如果您能找到合适的标签 @,您的问题会更快得到回复。
所有问题都会由我们的维护者阅读,如果您不知道该标记谁,只需留空,我们的维护人员会找到合适的开发组成员来解决问题。
标记的人数应该不超过 3 个人。
Related demo leader / 相关demo负责人 :
- finetune_demo: @Btlmd
- langchain_demo: @yincf
- composite_demo: @abmfy
If it's not a bug in these three subsections, you may not specify the helper. Our maintainer will find the right person in the development group to solve the problem.
如果不是这三个子版块的bug,您可以不指明帮助者,我们的维护人员会找到合适的开发组成员来解决问题。
placeholder: "@Username ..."
- type: checkboxes
id: information-scripts-examples
attributes:
label: Information / 问题信息
description: 'The problem arises when using: / 问题出现在'
options:
- label: "The official example scripts / 官方的示例脚本"
- label: "My own modified scripts / 我自己修改的脚本和任务"
- type: textarea
id: reproduction
validations:
required: true
attributes:
label: Reproduction / 复现过程
description: |
Please provide a code example that reproduces the problem you encountered, preferably with a minimal reproduction unit.
If you have code snippets, error messages, stack traces, please provide them here as well.
Please format your code correctly using code tags. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
Do not use screenshots, as they are difficult to read and (more importantly) do not allow others to copy and paste your code.
请提供能重现您遇到的问题的代码示例,最好是最小复现单元。
如果您有代码片段、错误信息、堆栈跟踪,也请在此提供。
请使用代码标签正确格式化您的代码。请参见 https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
请勿使用截图,因为截图难以阅读,而且(更重要的是)不允许他人复制粘贴您的代码。
placeholder: |
Steps to reproduce the behavior/复现Bug的步骤:
1.
2.
3.
- type: textarea
id: expected-behavior
validations:
required: true
attributes:
label: Expected behavior / 期待表现
description: "A clear and concise description of what you would expect to happen. /简单描述您期望发生的事情。"
\ No newline at end of file
name: "\U0001F680 Feature request"
description: Submit a request for a new ChatGLM3 feature / 提交一个新的 ChatGLM3 的功能建议
labels: [ "feature" ]
body:
- type: textarea
id: feature-request
validations:
required: true
attributes:
label: Feature request / 功能建议
description: |
A brief description of the functional proposal. Links to corresponding papers and code are desirable.
对功能建议的简述。最好提供对应的论文和代码链接
- type: textarea
id: motivation
validations:
required: true
attributes:
label: Motivation / 动机
description: |
Your motivation for making the suggestion. If that motivation is related to another GitHub issue, link to it here.
您提出建议的动机。如果该动机与另一个 GitHub 问题有关,请在此处提供对应的链接。
- type: textarea
id: contribution
validations:
required: true
attributes:
label: Your contribution / 您的贡献
description: |
Your PR link or any other link you can help with.
您的PR链接或者其他您能提供帮助的链接。
\ No newline at end of file
# Raise valuable PR / 提出有价值的PR
## Caution/ 注意事项:
Users should keep the following points in mind when submitting PRs:
1. The proposed PR should be about this project.
2. the proposed PR should be relevant, if there are multiple ideas and optimizations, they should be assigned to different PRs.
用户在提交PR时候应该注意以下几点:
1. 提出的PR应该是关于本项目的。
2. 提出的PR应该具有针对性,如果具有多个不同的想法和优化方案,应该分配到不同的PR中。
## 不应该提出的PR / PRs that should not be proposed
If a developer proposes a PR about any of the following, it may be closed or Rejected.
1. those that don't describe improvement options.
2. multiple issues of different types combined in one PR.
3. The proposed PR is highly duplicative of already existing PRs.
如果开发者提出关于以下方面的PR,则可能会被直接关闭或拒绝通过。
1. 没有说明改进方案的。
2. 多个不同类型的问题合并在一个PR中的。
3. 提出的PR与已经存在的PR高度重复的。
# 检查您的PR
- [ ] Have you read the Contributor Guidelines, Pull Request section? / 您是否阅读了贡献者指南、Pull Request 部分?
- [ ] Has this been discussed/approved via a Github issue or forum? If so, add a link. / 是否通过 Github 问题或论坛讨论/批准过?如果是,请添加链接。
- [ ] Did you make sure you updated the documentation with your changes? Here are the Documentation Guidelines, and here are the Documentation Formatting Tips. /您是否确保根据您的更改更新了文档?这里是文档指南,这里是文档格式化技巧。
- [ ] Did you write new required tests? / 您是否编写了新的必要测试?
- [ ] Are your PRs for only one issue / 您的PR是否仅针对一个问题
\ No newline at end of file
__pycache__
# finetune_demo: generated & downloaded files
finetune_demo/output
finetune_demo/data
finetune_demo/formatted_data
ToolAlpaca/
AdvertiseGen/
*.gz
*.idea
.DS_Store
\ No newline at end of file
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
COPY requirements.txt requirements.txt
RUN source /opt/dtk-23.04/env.sh
RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo 'Asia/Shanghai' >/etc/timezone
ENV LANG C.UTF-8
RUN pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
# Intel Device Demo
本文件夹主要辅助开发者 在 Intel 设备上加速部署 ChatGLM3-6B 模型。
## 1. 硬件准备
本文件夹中的设备支持列表包括:
- Intel CPU 系列, 包含个人CPU 和 服务器 / 工作站 CPU
- Intel Arc 独立显卡系列,包括 Arc A770 等显卡。
- Intel CPU 核显系列
- 其他理论支持 OpenVINO 加速的Intel 工具套件。
## 2. 文件目录
- IPEX_llm_xxx_demo: IPEX-LLM 是一个为Intel XPU(Xeon/Core/Flex/Arc/PVC)打造的低精度轻量级大语言模型库,在Intel平台上具有广泛的模型支持、最低的延迟和最小的内存占用,实现加速模型部署示例。
- OpenVINO_demo: 使用 Intel OpenVINO 推理加速框架,实现加速模型部署示例。
- Pytorch_demo (暂未推出) : 使用 Intel Pytorch Extension 实现在 Pytorch 环境上开发(适用于 Intel Arc 系列 GPU)
# coding=utf-8 """
# Implements API for ChatGLM3-6B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat) This script implements an API for the ChatGLM3-6B model,
# Usage: python openai_api.py formatted similarly to OpenAI's API (https://platform.openai.com/docs/api-reference/chat).
# Visit http://localhost:8000/docs for documents. It's designed to be run as a web server using FastAPI and uvicorn,
making the ChatGLM3-6B model accessible through OpenAI Client.
# 在OpenAI的API中,max_tokens 等价于 HuggingFace 的 max_new_tokens 而不是 max_length,。
# 例如,对于6b模型,设置max_tokens = 8192,则会报错,因为扣除历史记录和提示词后,模型不能输出那么多的tokens。 Key Components and Features:
- Model and Tokenizer Setup: Configures the model and tokenizer paths and loads them.
- FastAPI Configuration: Sets up a FastAPI application with CORS middleware for handling cross-origin requests.
- API Endpoints:
- "/v1/models": Lists the available models, specifically ChatGLM3-6B.
- "/v1/chat/completions": Processes chat completion requests with options for streaming and regular responses.
- "/v1/embeddings": Processes Embedding request of a list of text inputs.
- Token Limit Caution: In the OpenAI API, 'max_tokens' is equivalent to HuggingFace's 'max_new_tokens', not 'max_length'.
For instance, setting 'max_tokens' to 8192 for a 6b model would result in an error due to the model's inability to output
that many tokens after accounting for the history and prompt tokens.
- Stream Handling and Custom Functions: Manages streaming responses and custom function calls within chat responses.
- Pydantic Models: Defines structured models for requests and responses, enhancing API documentation and type safety.
- Main Execution: Initializes the model and tokenizer, and starts the FastAPI app on the designated host and port.
Note:
This script doesn't include the setup for special tokens or multi-GPU support by default.
Users need to configure their special tokens and can enable multi-GPU support as per the provided instructions.
Embedding Models only support in One GPU.
"""
import os import os
import time import time
from contextlib import asynccontextmanager import tiktoken
from typing import List, Literal, Optional, Union
import torch import torch
import uvicorn import uvicorn
from fastapi import FastAPI, HTTPException
from fastapi import FastAPI, HTTPException, Response
from fastapi.middleware.cors import CORSMiddleware from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
from typing import List, Literal, Optional, Union
from loguru import logger from loguru import logger
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from ipex_llm.transformers import AutoModel
from transformers import AutoTokenizer
from utils import process_response, generate_chatglm3, generate_stream_chatglm3
# from sentence_transformers import SentenceTransformer
from sse_starlette.sse import EventSourceResponse from sse_starlette.sse import EventSourceResponse
from transformers import AutoTokenizer, AutoModel
from utils import process_response, generate_chatglm3, generate_stream_chatglm3 # Set up limit request time
EventSourceResponse.DEFAULT_PING_INTERVAL = 1000
# set LLM path
MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/chatglm3-6b') MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/chatglm3-6b')
TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH) TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH)
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
# set Embedding Model path
EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', 'BAAI/bge-large-zh-v1.5')
@asynccontextmanager @asynccontextmanager
async def lifespan(app: FastAPI): # collects GPU memory async def lifespan(app: FastAPI):
yield yield
if torch.cuda.is_available(): if torch.cuda.is_available():
torch.cuda.empty_cache() torch.cuda.empty_cache()
...@@ -79,6 +108,33 @@ class DeltaMessage(BaseModel): ...@@ -79,6 +108,33 @@ class DeltaMessage(BaseModel):
function_call: Optional[FunctionCallResponse] = None function_call: Optional[FunctionCallResponse] = None
## for Embedding
class EmbeddingRequest(BaseModel):
input: List[str]
model: str
class CompletionUsage(BaseModel):
prompt_tokens: int
completion_tokens: int
total_tokens: int
class EmbeddingResponse(BaseModel):
data: list
model: str
object: str
usage: CompletionUsage
# for ChatCompletionRequest
class UsageInfo(BaseModel):
prompt_tokens: int = 0
total_tokens: int = 0
completion_tokens: Optional[int] = 0
class ChatCompletionRequest(BaseModel): class ChatCompletionRequest(BaseModel):
model: str model: str
messages: List[ChatMessage] messages: List[ChatMessage]
...@@ -86,8 +142,7 @@ class ChatCompletionRequest(BaseModel): ...@@ -86,8 +142,7 @@ class ChatCompletionRequest(BaseModel):
top_p: Optional[float] = 0.8 top_p: Optional[float] = 0.8
max_tokens: Optional[int] = None max_tokens: Optional[int] = None
stream: Optional[bool] = False stream: Optional[bool] = False
functions: Optional[Union[dict, List[dict]]] = None tools: Optional[Union[dict, List[dict]]] = None
# Additional parameters
repetition_penalty: Optional[float] = 1.1 repetition_penalty: Optional[float] = 1.1
...@@ -98,29 +153,68 @@ class ChatCompletionResponseChoice(BaseModel): ...@@ -98,29 +153,68 @@ class ChatCompletionResponseChoice(BaseModel):
class ChatCompletionResponseStreamChoice(BaseModel): class ChatCompletionResponseStreamChoice(BaseModel):
index: int
delta: DeltaMessage delta: DeltaMessage
finish_reason: Optional[Literal["stop", "length", "function_call"]] finish_reason: Optional[Literal["stop", "length", "function_call"]]
index: int
class UsageInfo(BaseModel):
prompt_tokens: int = 0
total_tokens: int = 0
completion_tokens: Optional[int] = 0
class ChatCompletionResponse(BaseModel): class ChatCompletionResponse(BaseModel):
model: str model: str
id: str
object: Literal["chat.completion", "chat.completion.chunk"] object: Literal["chat.completion", "chat.completion.chunk"]
choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]] choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
created: Optional[int] = Field(default_factory=lambda: int(time.time())) created: Optional[int] = Field(default_factory=lambda: int(time.time()))
usage: Optional[UsageInfo] = None usage: Optional[UsageInfo] = None
@app.get("/health")
async def health() -> Response:
"""Health check."""
return Response(status_code=200)
@app.post("/v1/embeddings", response_model=EmbeddingResponse)
async def get_embeddings(request: EmbeddingRequest):
embeddings = [embedding_model.encode(text) for text in request.input]
embeddings = [embedding.tolist() for embedding in embeddings]
def num_tokens_from_string(string: str) -> int:
"""
Returns the number of tokens in a text string.
use cl100k_base tokenizer
"""
encoding = tiktoken.get_encoding('cl100k_base')
num_tokens = len(encoding.encode(string))
return num_tokens
response = {
"data": [
{
"object": "embedding",
"embedding": embedding,
"index": index
}
for index, embedding in enumerate(embeddings)
],
"model": request.model,
"object": "list",
"usage": CompletionUsage(
prompt_tokens=sum(len(text.split()) for text in request.input),
completion_tokens=0,
total_tokens=sum(num_tokens_from_string(text) for text in request.input),
)
}
return response
@app.get("/v1/models", response_model=ModelList) @app.get("/v1/models", response_model=ModelList)
async def list_models(): async def list_models():
model_card = ModelCard(id="chatglm3-6b") model_card = ModelCard(
return ModelList(data=[model_card]) id="chatglm3-6b"
)
return ModelList(
data=[model_card]
)
@app.post("/v1/chat/completions", response_model=ChatCompletionResponse) @app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
...@@ -138,24 +232,74 @@ async def create_chat_completion(request: ChatCompletionRequest): ...@@ -138,24 +232,74 @@ async def create_chat_completion(request: ChatCompletionRequest):
echo=False, echo=False,
stream=request.stream, stream=request.stream,
repetition_penalty=request.repetition_penalty, repetition_penalty=request.repetition_penalty,
functions=request.functions, tools=request.tools,
) )
logger.debug(f"==== request ====\n{gen_params}") logger.debug(f"==== request ====\n{gen_params}")
if request.stream: if request.stream:
generate = predict(request.model, gen_params)
return EventSourceResponse(generate, media_type="text/event-stream")
# Use the stream mode to read the first few characters, if it is not a function call, direct stram output
predict_stream_generator = predict_stream(request.model, gen_params)
output = next(predict_stream_generator)
if not contains_custom_function(output):
return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")
# Obtain the result directly at one time and determine whether tools needs to be called.
logger.debug(f"First result output:\n{output}")
function_call = None
if output and request.tools:
try:
function_call = process_response(output, use_tool=True)
except:
logger.warning("Failed to parse tool call")
# CallFunction
if isinstance(function_call, dict):
function_call = FunctionCallResponse(**function_call)
"""
In this demo, we did not register any tools.
You can use the tools that have been implemented in our `tools_using_demo` and implement your own streaming tool implementation here.
Similar to the following method:
function_args = json.loads(function_call.arguments)
tool_response = dispatch_tool(tool_name: str, tool_params: dict)
"""
tool_response = ""
if not gen_params.get("messages"):
gen_params["messages"] = []
gen_params["messages"].append(ChatMessage(
role="assistant",
content=output,
))
gen_params["messages"].append(ChatMessage(
role="function",
name=function_call.name,
content=tool_response,
))
# Streaming output of results after function calls
generate = predict(request.model, gen_params)
return EventSourceResponse(generate, media_type="text/event-stream")
else:
# Handled to avoid exceptions in the above parsing function process.
generate = parse_output_text(request.model, output)
return EventSourceResponse(generate, media_type="text/event-stream")
# Here is the handling of stream = False
response = generate_chatglm3(model, tokenizer, gen_params) response = generate_chatglm3(model, tokenizer, gen_params)
# Remove the first newline character # Remove the first newline character
if response["text"].startswith("\n"): if response["text"].startswith("\n"):
response["text"] = response["text"][1:] response["text"] = response["text"][1:]
response["text"] = response["text"].strip() response["text"] = response["text"].strip()
usage = UsageInfo() usage = UsageInfo()
function_call, finish_reason = None, "stop" function_call, finish_reason = None, "stop"
if request.functions: if request.tools:
try: try:
function_call = process_response(response["text"], use_tool=True) function_call = process_response(response["text"], use_tool=True)
except: except:
...@@ -181,7 +325,14 @@ async def create_chat_completion(request: ChatCompletionRequest): ...@@ -181,7 +325,14 @@ async def create_chat_completion(request: ChatCompletionRequest):
task_usage = UsageInfo.model_validate(response["usage"]) task_usage = UsageInfo.model_validate(response["usage"])
for usage_key, usage_value in task_usage.model_dump().items(): for usage_key, usage_value in task_usage.model_dump().items():
setattr(usage, usage_key, getattr(usage, usage_key) + usage_value) setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)
return ChatCompletionResponse(model=request.model, choices=[choice_data], object="chat.completion", usage=usage)
return ChatCompletionResponse(
model=request.model,
id="", # for open_source model, id is empty
choices=[choice_data],
object="chat.completion",
usage=usage
)
async def predict(model_id: str, params: dict): async def predict(model_id: str, params: dict):
...@@ -192,7 +343,7 @@ async def predict(model_id: str, params: dict): ...@@ -192,7 +343,7 @@ async def predict(model_id: str, params: dict):
delta=DeltaMessage(role="assistant"), delta=DeltaMessage(role="assistant"),
finish_reason=None finish_reason=None
) )
chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk") chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True)) yield "{}".format(chunk.model_dump_json(exclude_unset=True))
previous_text = "" previous_text = ""
...@@ -210,7 +361,8 @@ async def predict(model_id: str, params: dict): ...@@ -210,7 +361,8 @@ async def predict(model_id: str, params: dict):
try: try:
function_call = process_response(decoded_unicode, use_tool=True) function_call = process_response(decoded_unicode, use_tool=True)
except: except:
logger.warning("Failed to parse tool call, maybe the response is not a tool call or have been answered.") logger.warning(
"Failed to parse tool call, maybe the response is not a tool call or have been answered.")
if isinstance(function_call, dict): if isinstance(function_call, dict):
function_call = FunctionCallResponse(**function_call) function_call = FunctionCallResponse(**function_call)
...@@ -226,7 +378,12 @@ async def predict(model_id: str, params: dict): ...@@ -226,7 +378,12 @@ async def predict(model_id: str, params: dict):
delta=delta, delta=delta,
finish_reason=finish_reason finish_reason=finish_reason
) )
chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk") chunk = ChatCompletionResponse(
model=model_id,
id="",
choices=[choice_data],
object="chat.completion.chunk"
)
yield "{}".format(chunk.model_dump_json(exclude_unset=True)) yield "{}".format(chunk.model_dump_json(exclude_unset=True))
choice_data = ChatCompletionResponseStreamChoice( choice_data = ChatCompletionResponseStreamChoice(
...@@ -234,16 +391,141 @@ async def predict(model_id: str, params: dict): ...@@ -234,16 +391,141 @@ async def predict(model_id: str, params: dict):
delta=DeltaMessage(), delta=DeltaMessage(),
finish_reason="stop" finish_reason="stop"
) )
chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk") chunk = ChatCompletionResponse(
model=model_id,
id="",
choices=[choice_data],
object="chat.completion.chunk"
)
yield "{}".format(chunk.model_dump_json(exclude_unset=True)) yield "{}".format(chunk.model_dump_json(exclude_unset=True))
yield '[DONE]' yield '[DONE]'
if __name__ == "__main__": def predict_stream(model_id, gen_params):
"""
The function call is compatible with stream mode output.
The first seven characters are determined.
If not a function call, the stream output is directly generated.
Otherwise, the complete character content of the function call is returned.
:param model_id:
:param gen_params:
:return:
"""
output = ""
is_function_call = False
has_send_first_chunk = False
for new_response in generate_stream_chatglm3(model, tokenizer, gen_params):
decoded_unicode = new_response["text"]
delta_text = decoded_unicode[len(output):]
output = decoded_unicode
# When it is not a function call and the character length is> 7,
# try to judge whether it is a function call according to the special function prefix
if not is_function_call and len(output) > 7:
# Determine whether a function is called
is_function_call = contains_custom_function(output)
if is_function_call:
continue
# Non-function call, direct stream output
finish_reason = new_response["finish_reason"]
# Send an empty string first to avoid truncation by subsequent next() operations.
if not has_send_first_chunk:
message = DeltaMessage(
content="",
role="assistant",
function_call=None,
)
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=message,
finish_reason=finish_reason
)
chunk = ChatCompletionResponse(
model=model_id,
id="",
choices=[choice_data],
created=int(time.time()),
object="chat.completion.chunk"
)
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
send_msg = delta_text if has_send_first_chunk else output
has_send_first_chunk = True
message = DeltaMessage(
content=send_msg,
role="assistant",
function_call=None,
)
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=message,
finish_reason=finish_reason
)
chunk = ChatCompletionResponse(
model=model_id,
id="",
choices=[choice_data],
created=int(time.time()),
object="chat.completion.chunk"
)
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
if is_function_call:
yield output
else:
yield '[DONE]'
async def parse_output_text(model_id: str, value: str):
"""
Directly output the text content of value
:param model_id:
:param value:
:return:
"""
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(role="assistant", content=value),
finish_reason=None
)
chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(),
finish_reason="stop"
)
chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
yield '[DONE]'
def contains_custom_function(value: str) -> bool:
"""
Determine whether 'function_call' according to a special function prefix.
For example, the functions defined in "tools_using_demo/tool_register.py" are all "get_xxx" and start with "get_"
[Note] This is not a rigorous judgment method, only for reference.
:param value:
:return:
"""
return value and 'get_' in value
if __name__ == "__main__":
# Load LLM
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)
if 'cuda' in DEVICE: # AMD, NVIDIA GPU can use Half Precision model = AutoModel.from_pretrained(MODEL_PATH,
model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).to(DEVICE).eval() load_in_4bit=True,
else: # CPU, Intel GPU and other GPU can use Float16 Precision Only trust_remote_code=True)
model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).float().to(DEVICE).eval() # load Embedding
# embedding_model = SentenceTransformer(EMBEDDING_PATH, device="cuda")
uvicorn.run(app, host='0.0.0.0', port=8000, workers=1) uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
import time
from ipex_llm.transformers import AutoModel
from transformers import AutoTokenizer
CHATGLM_V3_PROMPT_FORMAT = "\n{prompt}\n"
# Please specify the local path to the chatglm3-6b model
model_path = "D:\AI\ChatGLM3\model/chatglm3-6b/"
# Load the ChatGLM3-6B model and quantize it to INT4
model = AutoModel.from_pretrained(model_path,
load_in_4bit=True,
trust_remote_code=True)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
# Prepare ChatGLM3 format prompt
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="Who are you?")
# Encode the prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt")
st = time.time()
# Perform inference calculation and generate Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# Decode the generated Tokens and display them
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)
"""
This script creates an interactive web demo for the ChatGLM3-6B model using Gradio,
a Python library for building quick and easy UI components for machine learning models.
It's designed to showcase the capabilities of the ChatGLM3-6B model in a user-friendly interface,
allowing users to interact with the model through a chat-like interface.
Usage:
- Run the script to start the Gradio web server.
- Interact with the model by typing questions and receiving responses.
Requirements:
- Gradio (required for 4.13.0 and later, 3.x is not support now) should be installed.
Note: The script includes a modification to the Chatbot's postprocess method to handle markdown to HTML conversion,
ensuring that the chat interface displays formatted text correctly.
"""
import os
import streamlit as st
from ipex_llm.transformers import AutoModel
from transformers import AutoTokenizer
st.set_page_config(
page_title="ChatGLM3-6B+BigDL-LLM demo",
page_icon=":robot:",
layout="wide"
)
MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/chatglm3-6b')
@st.cache_resource
def get_model():
model = AutoModel.from_pretrained(MODEL_PATH,
load_in_4bit=True,
trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH,
trust_remote_code=True)
return tokenizer, model
tokenizer, model = get_model()
if "history" not in st.session_state:
st.session_state.history = []
if "past_key_values" not in st.session_state:
st.session_state.past_key_values = None
max_length = st.sidebar.slider("max_length", 0, 32768, 8192, step=1)
top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01)
temperature = st.sidebar.slider("temperature", 0.0, 1.0, 0.6, step=0.01)
buttonClean = st.sidebar.button("clearing session history", key="clean")
if buttonClean:
st.session_state.history = []
st.session_state.past_key_values = None
st.rerun()
for i, message in enumerate(st.session_state.history):
if message["role"] == "user":
with st.chat_message(name="user", avatar="user"):
st.markdown(message["content"])
else:
with st.chat_message(name="assistant", avatar="assistant"):
st.markdown(message["content"])
with st.chat_message(name="user", avatar="user"):
input_placeholder = st.empty()
with st.chat_message(name="assistant", avatar="assistant"):
message_placeholder = st.empty()
prompt_text = st.chat_input("please enter your question.")
if prompt_text:
input_placeholder.markdown(prompt_text)
history = st.session_state.history
past_key_values = st.session_state.past_key_values
for response, history, past_key_values in model.stream_chat(
tokenizer,
prompt_text,
history,
past_key_values=past_key_values,
max_length=max_length,
top_p=top_p,
temperature=temperature,
return_past_key_values=True,
):
message_placeholder.markdown(response)
st.session_state.history = history
st.session_state.past_key_values = past_key_values
\ No newline at end of file
import torch
import time
import argparse
import numpy as np
from ipex_llm.transformers import AutoModel
from modelscope import AutoTokenizer
from transformers import AutoTokenizer
# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://github.com/THUDM/ChatGLM3/blob/main/PROMPT.md
CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ModelScope ChatGLM3 model')
parser.add_argument('--repo-id-or-model-path', type=str, default="ZhipuAI/chatglm3-6b",
help='The ModelScope repo id for the ChatGLM3 model to be downloaded'
', or the path to the ModelScope checkpoint folder')
parser.add_argument('--prompt', type=str, default="AI是什么?",
help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
# It is important to set `model_hub='modelscope'`, otherwise model hub is default to be huggingface
model = AutoModel.from_pretrained(model_path,
load_in_4bit=True,
trust_remote_code=True,
model_hub='modelscope')
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
# Generate predicted tokens
with torch.inference_mode():
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
input_ids = tokenizer.encode(prompt, return_tensors="pt")
st = time.time()
# if your selected model is capable of utilizing previous key/value attentions
# to enhance decoding speed, but has `"use_cache": false` in its model config,
# it is important to set `use_cache=True` explicitly in the `generate` function
# to obtain optimal performance with IPEX-LLM INT4 optimizations
output = model.generate(input_ids,
max_new_tokens=args.n_predict)
end = time.time()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end - st} s')
print('-' * 20, 'Prompt', '-' * 20)
print(prompt)
print('-' * 20, 'Output', '-' * 20)
print(output_str)
\ No newline at end of file
"""
This script is an example of using the OpenAI API to create various interactions with a ChatGLM3 model.
It includes functions to:
1. Conduct a basic chat session, asking about weather conditions in multiple cities.
2. Initiate a simple chat in Chinese, asking the model to tell a short story.
3. Retrieve and print embeddings for a given text input.
Each function demonstrates a different aspect of the API's capabilities, showcasing how to make requests
and handle responses.
"""
from openai import OpenAI
import time
base_url = "http://127.0.0.1:8000/v1/"
client = OpenAI(api_key="EMPTY", base_url=base_url)
def function_chat():
messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
response = client.chat.completions.create(
model="chatglm3-6b",
messages=messages,
tools=tools,
tool_choice="auto",
)
if response:
content = response.choices[0].message.content
print(content)
else:
print("Error:", response.status_code)
def simple_chat(use_stream=True):
messages = [
{
"role": "system",
"content": "You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's "
"instructions carefully. Respond using markdown.",
},
{
"role": "user",
"content": "你好,请你用生动的话语给我讲一个小故事吧"
}
]
response = client.chat.completions.create(
model="chatglm3-6b",
messages=messages,
stream=use_stream,
max_tokens=256,
temperature=0.8,
presence_penalty=1.1,
top_p=0.8)
if response:
if use_stream:
for chunk in response:
print(chunk.choices[0].delta.content)
else:
content = response.choices[0].message.content
print(content)
else:
print("Error:", response.status_code)
if __name__ == "__main__":
simple_chat(use_stream=False)
simple_chat(use_stream=True)
import gc
import json
import torch
from transformers import PreTrainedModel, PreTrainedTokenizer
from transformers.generation.logits_process import LogitsProcessor
from typing import Union, Tuple
class InvalidScoreLogitsProcessor(LogitsProcessor):
def __call__(
self, input_ids: torch.LongTensor, scores: torch.FloatTensor
) -> torch.FloatTensor:
if torch.isnan(scores).any() or torch.isinf(scores).any():
scores.zero_()
scores[..., 5] = 5e4
return scores
def process_response(output: str, use_tool: bool = False) -> Union[str, dict]:
content = ""
for response in output.split("<|assistant|>"):
metadata, content = response.split("\n", maxsplit=1)
if not metadata.strip():
content = content.strip()
content = content.replace("[[训练时间]]", "2023年")
else:
if use_tool:
content = "\n".join(content.split("\n")[1:-1])
def tool_call(**kwargs):
return kwargs
parameters = eval(content)
content = {
"name": metadata.strip(),
"arguments": json.dumps(parameters, ensure_ascii=False)
}
else:
content = {
"name": metadata.strip(),
"content": content
}
return content
@torch.inference_mode()
def generate_stream_chatglm3(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, params: dict):
messages = params["messages"]
tools = params["tools"]
temperature = float(params.get("temperature", 1.0))
repetition_penalty = float(params.get("repetition_penalty", 1.0))
top_p = float(params.get("top_p", 1.0))
max_new_tokens = int(params.get("max_tokens", 256))
echo = params.get("echo", True)
messages = process_chatglm_messages(messages, tools=tools)
query, role = messages[-1]["content"], messages[-1]["role"]
inputs = tokenizer.build_chat_input(query, history=messages[:-1], role=role)
inputs = inputs.to(model.device)
input_echo_len = len(inputs["input_ids"][0])
if input_echo_len >= model.config.seq_length:
print(f"Input length larger than {model.config.seq_length}")
eos_token_id = [
tokenizer.eos_token_id,
tokenizer.get_command("<|user|>"),
]
gen_kwargs = {
"max_new_tokens": max_new_tokens,
"do_sample": True if temperature > 1e-5 else False,
"top_p": top_p,
"repetition_penalty": repetition_penalty,
"logits_processor": [InvalidScoreLogitsProcessor()],
}
if temperature > 1e-5:
gen_kwargs["temperature"] = temperature
total_len = 0
for total_ids in model.stream_generate(**inputs, eos_token_id=eos_token_id, **gen_kwargs):
total_ids = total_ids.tolist()[0]
total_len = len(total_ids)
if echo:
output_ids = total_ids[:-1]
else:
output_ids = total_ids[input_echo_len:-1]
response = tokenizer.decode(output_ids)
if response and response[-1] != "�":
response, stop_found = apply_stopping_strings(response, ["<|observation|>"])
yield {
"text": response,
"usage": {
"prompt_tokens": input_echo_len,
"completion_tokens": total_len - input_echo_len,
"total_tokens": total_len,
},
"finish_reason": "function_call" if stop_found else None,
}
if stop_found:
break
# Only last stream result contains finish_reason, we set finish_reason as stop
ret = {
"text": response,
"usage": {
"prompt_tokens": input_echo_len,
"completion_tokens": total_len - input_echo_len,
"total_tokens": total_len,
},
"finish_reason": "stop",
}
yield ret
gc.collect()
torch.cuda.empty_cache()
def process_chatglm_messages(messages, tools=None):
_messages = messages
messages = []
if tools:
messages.append(
{
"role": "system",
"content": "Answer the following questions as best as you can. You have access to the following tools:",
"tools": tools
}
)
for m in _messages:
role, content, func_call = m.role, m.content, m.function_call
if role == "function":
messages.append(
{
"role": "observation",
"content": content
}
)
elif role == "assistant" and func_call is not None:
for response in content.split("<|assistant|>"):
metadata, sub_content = response.split("\n", maxsplit=1)
messages.append(
{
"role": role,
"metadata": metadata,
"content": sub_content.strip()
}
)
else:
messages.append({"role": role, "content": content})
return messages
def generate_chatglm3(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, params: dict):
for response in generate_stream_chatglm3(model, tokenizer, params):
pass
return response
def apply_stopping_strings(reply, stop_strings) -> Tuple[str, bool]:
stop_found = False
for string in stop_strings:
idx = reply.find(string)
if idx != -1:
reply = reply[:idx]
stop_found = True
break
if not stop_found:
# If something like "\nYo" is generated just before "\nYou: is completed, trim it
for string in stop_strings:
for j in range(len(string) - 1, 0, -1):
if reply[-j:] == string[:j]:
reply = reply[:-j]
break
else:
continue
break
return reply, stop_found
# 使用 OpenVINO 部署ChatGLM3-6B 模型
[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) 是 Intel 为深度学习推理而设计的开源工具包。它可以帮助开发者优化模型,提高推理性能,减少模型的内存占用。本示例将展示如何使用 OpenVINO 部署 ChatGLM3。
你需要克隆本仓库,然后按照以下步骤进行操作来将模型转换你的 OpenVINO IR 模型,随后进行推理。
## 1. 环境配置
首先,克隆OpenVINO GLM3 推理仓库并安装依赖。
```bash
git clone https://github.com/OpenVINO-dev-contest/chatglm3.openvino.git
cd chatglm3.openvino
```
接着,我们推荐您新建一个虚拟环境,然后按照以下安装依赖。
```
python3 -m venv openvino_env
source openvino_env/bin/activate
python3 -m pip install --upgrade pip
pip install wheel setuptools
pip install -r requirements.txt
```
## 2. 转换模型
由于需要将Huggingface模型转换为OpenVINO IR模型,因此您需要下载模型并转换。
```
python3 convert.py --model_id THUDM/chatglm3-6b --output {your_path}/chatglm3-6b
```
### 可以选择的参数
* `--model_id` - 模型所在目录的路径(绝对路径)。
* `--output` - 转换后模型保存的地址
## 3. 量化模型(非必须)
```
python3 quantize.py --model_path {your_path}/chatglm3-6b --precision int4 --output {your_path}/chatglm3-6b-int4
```
### 可以选择的参数
* `--model_path` - OpenVINO IR 模型所在目录的路径。
* `-- precision` - 量化精度:int8 或 int4。
* `--output` - 保存模型的路径。
## 4. 运行 ChatGLM3 模型
```
python3 chat.py --model_path {your_path}/chatglm3-6b --max_sequence_length 4096 --device CPU
```
### 可以选择的参数
* `--model_path` - OpenVINO IR 模型所在目录的路径。
* `--max_sequence_length` - 输出标记的最大大小。
* `--device` - 运行推理的设备。
## 例子
```
用户: 你好
ChatGLM3-6B-OpenVINO: 你好!有什么我可以帮助你的吗?
用户: 你是谁?
ChatGLM3-6B-OpenVINO: 我是一个名为ChatGLM3-6B的人工智能助手,是由清华大学KEG实验室和智谱AI 公司于2023 年共同训练的语言模型开发而成。我的任务是针对用户的问题和要求提供适当的答复和支持。
用户: 请给我讲一个有趣的故事
ChatGLM3-6B-OpenVINO: 从前,有一个名叫小明的小男孩,他是一个非常喜欢动物的人。有一天,他在森林里散步时,发现了一个非常漂亮的小鸟。小鸟受伤了,无法飞行。小明非常心疼,于是决定照顾这只小鸟。小明带着小鸟回家,为它搭建了一个小小的巢穴,并找来了一些软草和食物。每天,他都会给小鸟喂食,并为它换水。渐渐地,小鸟的伤势好了起来,开始在小明的家里飞来飞去,它们成了非常好的朋友。然而,一天,小明的父母告诉他,他们必须把小明养的小鸟送到森林里去。小明非常伤心,因为他已经和小鸟成为了好朋友。但是,他的父母告诉他,小鸟在森林里会更加自由自在,而且他也可以继续观看小鸟在森林中的生活。于是,小明和他的父母一起将小鸟送到了森林中。小鸟非常高兴,因为它又可以飞行了,并且还有许多其他的小动物朋友。小明也感到非常开心,因为他知道,即使不能一直拥有小鸟,他仍然可以欣赏到它们在自然中的美丽。从此以后,小明常常来到森林中,寻找小鸟。
用户: 请给这个故事起一个标题
ChatGLM3-6B-OpenVINO: 《友谊的力量:小明与小鸟的森林冒险》
```
## 常见问题
1. 为什么倒入本地模型还会报 huggingface 链接错误
- 降级 transformers 库到 4.37.2 版本
2. 需要安装 OpenVINO C++ 推理引擎吗
- 不需要
3. 一定要使用 Intel 的硬件吗?
- 我们仅在 Intel 设备上尝试,我们推荐使用x86架构的英特尔设备,包括但不限制于:
- 英特尔的CPU,包括个人电脑CPU 和服务器CPU。
- 英特尔的独立显卡。例如:ARC A770 显卡。
\ No newline at end of file
import argparse
from typing import List, Tuple
from threading import Thread
import torch
from optimum.intel.openvino import OVModelForCausalLM
from transformers import (AutoTokenizer, AutoConfig,
TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria)
def parse_text(text):
lines = text.split("\n")
lines = [line for line in lines if line != ""]
count = 0
for i, line in enumerate(lines):
if "```" in line:
count += 1
items = line.split('`')
if count % 2 == 1:
lines[i] = f'<pre><code class="language-{items[-1]}">'
else:
lines[i] = f'<br></code></pre>'
else:
if i > 0:
if count % 2 == 1:
line = line.replace("`", "\`")
line = line.replace("<", "&lt;")
line = line.replace(">", "&gt;")
line = line.replace(" ", "&nbsp;")
line = line.replace("*", "&ast;")
line = line.replace("_", "&lowbar;")
line = line.replace("-", "&#45;")
line = line.replace(".", "&#46;")
line = line.replace("!", "&#33;")
line = line.replace("(", "&#40;")
line = line.replace(")", "&#41;")
line = line.replace("$", "&#36;")
lines[i] = "<br>" + line
text = "".join(lines)
return text
class StopOnTokens(StoppingCriteria):
def __init__(self, token_ids):
self.token_ids = token_ids
def __call__(
self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
) -> bool:
for stop_id in self.token_ids:
if input_ids[0][-1] == stop_id:
return True
return False
if __name__ == "__main__":
parser = argparse.ArgumentParser(add_help=False)
parser.add_argument('-h',
'--help',
action='help',
help='Show this help message and exit.')
parser.add_argument('-m',
'--model_path',
required=True,
type=str,
help='Required. model path')
parser.add_argument('-l',
'--max_sequence_length',
default=256,
required=False,
type=int,
help='Required. maximun length of output')
parser.add_argument('-d',
'--device',
default='CPU',
required=False,
type=str,
help='Required. device for inference')
args = parser.parse_args()
model_dir = args.model_path
ov_config = {"PERFORMANCE_HINT": "LATENCY",
"NUM_STREAMS": "1", "CACHE_DIR": ""}
tokenizer = AutoTokenizer.from_pretrained(
model_dir, trust_remote_code=True)
print("====Compiling model====")
ov_model = OVModelForCausalLM.from_pretrained(
model_dir,
device=args.device,
ov_config=ov_config,
config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
trust_remote_code=True,
)
streamer = TextIteratorStreamer(
tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True
)
stop_tokens = [0, 2]
stop_tokens = [StopOnTokens(stop_tokens)]
def convert_history_to_token(history: List[Tuple[str, str]]):
messages = []
for idx, (user_msg, model_msg) in enumerate(history):
if idx == len(history) - 1 and not model_msg:
messages.append({"role": "user", "content": user_msg})
break
if user_msg:
messages.append({"role": "user", "content": user_msg})
if model_msg:
messages.append({"role": "assistant", "content": model_msg})
model_inputs = tokenizer.apply_chat_template(messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt")
return model_inputs
history = []
print("====Starting conversation====")
while True:
input_text = input("用户: ")
if input_text.lower() == 'stop':
break
if input_text.lower() == 'clear':
history = []
print("AI助手: 对话历史已清空")
continue
print("ChatGLM3-6B-OpenVINO:", end=" ")
history = history + [[parse_text(input_text), ""]]
model_inputs = convert_history_to_token(history)
generate_kwargs = dict(
input_ids=model_inputs,
max_new_tokens=args.max_sequence_length,
temperature=0.1,
do_sample=True,
top_p=1.0,
top_k=50,
repetition_penalty=1.1,
streamer=streamer,
stopping_criteria=StoppingCriteriaList(stop_tokens)
)
t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
t1.start()
partial_text = ""
for new_text in streamer:
new_text = new_text
print(new_text, end="", flush=True)
partial_text += new_text
print("\n")
history[-1][1] = partial_text
\ No newline at end of file
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2024 ChatGLM team @ Zhipu AI
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
...@@ -132,6 +132,7 @@ data[:5] ...@@ -132,6 +132,7 @@ data[:5]
<|assistant|> <|assistant|>
该文件看起来包含有关某些条目的元数据,每个条目有以下字段: 该文件看起来包含有关某些条目的元数据,每个条目有以下字段:
- `file_name`: 文件名称 - `file_name`: 文件名称
- `name`: 名称 - `name`: 名称
- `type`: 类型(例如 "survivor" 或 "killer") - `type`: 类型(例如 "survivor" 或 "killer")
......
...@@ -29,7 +29,7 @@ Where `<|role|>` part is represented in a special token, which can’t be encod ...@@ -29,7 +29,7 @@ Where `<|role|>` part is represented in a special token, which can’t be encod
### Example Scenarios ### Example Scenarios
For better readablity, an extra `\n` is added before each role special token. This extra `\n` should not be added in actual use and tokenizer implementation. For better readability, an extra `\n` is added before each role special token. This extra `\n` should not be added in actual use and tokenizer implementation.
#### Multi-turn Dialogue #### Multi-turn Dialogue
* There are only three roles: `<|user|>`, `<|assistant|>`, and `<|system|>`. * There are only three roles: `<|user|>`, `<|assistant|>`, and `<|system|>`.
......
...@@ -33,44 +33,43 @@ ChatGLM3-6B同样采用Transformer模型结构: ...@@ -33,44 +33,43 @@ ChatGLM3-6B同样采用Transformer模型结构:
### Docker(方式一) ### Docker(方式一)
推荐使用docker方式运行,提供拉取的docker镜像: 推荐使用docker方式运行,提供拉取的docker镜像:
``` ```bash
docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py310
``` ```
进入docker,安装docker中没有的依赖: 进入docker,安装docker中没有的依赖:
``` ```bash
docker run -dit --network=host --name=chatglm3 --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest docker run -dit --network=host --name=chatglm3 --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py310
docker exec -it chatglm3 /bin/bash docker exec -it chatglm3 /bin/bash
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
cd finetune_demo
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
``` ```
### Dockerfile(方式二) ### Conda(方法二)
```
docker build -t chatglm3:latest .
docker run -dit --network=host --name=chatglm3 --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 chatglm3:latest
docker exec -it chatglm3 /bin/bash
```
### Conda(方法三)
1. 创建conda虚拟环境: 1. 创建conda虚拟环境:
```
conda create -n chatglm python=3.8 ```bash
conda create -n chatglm python=3.10
``` ```
2. 关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。 2. 关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
- [DTK 23.04](https://cancon.hpccube.com:65024/1/main/DTK-23.04.1)
- [Pytorch 1.13.1](https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04) - [DTK 23.10.1](https://cancon.hpccube.com:65024/1/main/DTK-23.10.1)
- [Deepspeed 0.9.2](https://cancon.hpccube.com:65024/4/main/deepspeed/dtk23.04) - [Pytorch 2.1](https://cancon.hpccube.com:65024/4/main/pytorch/previous_release/dtk23.10)
- [Deepspeed 0.12.3](https://cancon.hpccube.com:65024/4/main/deepspeed/previous_release/dtk23.10)
Tips:以上dtk驱动、python、deepspeed等工具版本需要严格一一对应。 Tips:以上dtk驱动、python、deepspeed等工具版本需要严格一一对应。
3. 其它依赖库参照requirements.txt安装: 3. 其它依赖库参照requirements.txt安装:
``` ```bash
pip install -r requirements.txt pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
cd finetune_demo
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
``` ```
### 注意 1 ### 注意
``` ```python
#到虚拟环境下对应的python/site-packages注释掉一些版本判断 #到虚拟环境下对应的python/site-packages注释掉一些版本判断
site-packages/accelerate/accelerator.py 文件 site-packages/accelerate/accelerator.py 文件
...@@ -89,7 +88,7 @@ site-packages/transformers/utils/versions.py 文件 ...@@ -89,7 +88,7 @@ site-packages/transformers/utils/versions.py 文件
## 数据集 ## 数据集
单轮对话数据以[ADGEN](https://aclanthology.org/D19-1321.pdf) (广告生成) 数据集为例介绍代码的使用方法,该数据集任务为根据输入(content)生成一段广告词(summary),以下为下载地址: 单轮对话数据以[ADGEN](https://aclanthology.org/D19-1321.pdf) (广告生成) 数据集为例介绍代码的使用方法,该数据集任务为根据输入(content)生成一段广告词(summary),以下为下载地址:
- [Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing) 或者 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1) - [Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing) 或者 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1)
下载处理好的 ADGEN 数据集,将解压后的AdvertiseGen目录放到 [finetune_chatmodel_demo](./finetune_chatmodel_demo)目录下。数据集目录结构如下: 下载处理好的 ADGEN 数据集,将解压后的AdvertiseGen目录放到 [finetune_demo/data](./finetune_demo/data)目录下。数据集目录结构如下:
``` ```
── AdvertiseGen ── AdvertiseGen
│   ├── dev.json │   ├── dev.json
...@@ -97,18 +96,10 @@ site-packages/transformers/utils/versions.py 文件 ...@@ -97,18 +96,10 @@ site-packages/transformers/utils/versions.py 文件
``` ```
通过以下方式将数据集处理成模型需要的格式: 通过以下方式将数据集处理成模型需要的格式:
```bash ```bash
cd finetune_chatmodel_demo cd finetune_demo
python ./scripts/format_advertise_gen.py --path "AdvertiseGen/train.json" python process.py
```
多轮对话及工具调用数据以[ToolAlpaca](https://github.com/tangqiaoyu/ToolAlpaca)数据集为例介绍代码的使用方法,下载数据集,并通过以下方式将数据集处理成模型需要的格式:
```bash
cd finetune_chatmodel_demo
python ./scripts/format_tool_alpaca.py --path "train_data.json"
``` ```
### 模型下载 ### 模型下载
| Model | Seq Length | Download | Model | Seq Length | Download
| :---: |:---------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------: | :---: |:---------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------:
...@@ -118,69 +109,74 @@ python ./scripts/format_tool_alpaca.py --path "train_data.json" ...@@ -118,69 +109,74 @@ python ./scripts/format_tool_alpaca.py --path "train_data.json"
## 训练 ## 训练
### P-tuning v2 微调训练 ### SFT微调
本仓库实现了对于ChatGLM3-6B模型基于[P-Tuning v2](https://github.com/THUDM/P-tuning-v2)的微调。P-Tuning v2是由清华大学提出的一种高效参数微调方法。
#### 单轮对话微调 #### 单轮对话微调
```bash
cd ./finetune_demo
bash sft.sh
``` ```
cd ./finetune_chatmodel_demo/scripts 注意:请根据自己的需求配置其中的模型路径、数据集路径;batchsize、学习率等参数在./finetune_demo/configs/sft.yaml;
bash finetune_pt.sh
```
注意:请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数;
#### 多轮对话及工具能力微调 #### 推理验证
``` 对于输入输出格式的微调,可使用 `sft_inf.sh` 进行基本的推理验证。
cd ./finetune_chatmodel_demo/scripts
bash finetune_pt_multiturn.sh
```
注意:请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数;
### 全参数微调 在完成微调任务之后,我们可以查看到 `output` 文件夹下多了很多个`checkpoint-*`的文件夹,这些文件夹代表了训练的轮数。 我们选择最后一轮的微调权重,并使用inference进行导入。
#### 单轮对话微调 注意:此时要将hf上下载的原生`tokenizer_config.json``tokenization_chatglm.py` 两个文件放入要待测的 checkpoint 文件夹下,比如./finetune_demo/output/checkpoint-3000/
```
cd ./finetune_chatmodel_demo/scripts
bash finetune_ds.sh
```
注意:请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数;
#### 多轮对话及工具能力微调 ```bash
``` cd ./finetune_demo
cd ./finetune_chatmodel_demo/scripts bash sft_inf.sh
bash finetune_ds_multiturn.sh
``` ```
注意:请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数;
### 推理验证
对于输入输出格式的微调,可使用 `inference.py` 进行基本的推理验证。
### LORA微调
#### 单轮对话微调
```bash ```bash
python inference.py \ cd ./fintune_demo
--pt-checkpoint "path to p-tuning checkpoint" \ bash lora.sh
--model THUDM/chatglm3-6b
``` ```
注意:请根据自己的需求配置其中的模型路径、数据集路径;batchsize、学习率等参数在 ./finetune_demo/configs/lora.yaml;
#### 推理验证
在完成微调任务之后,我们可以查看到 `output` 文件夹下多了很多个`checkpoint-*`的文件夹,这些文件夹代表了训练的轮数。 我们选择最后一轮的微调权重,并使用inference进行导入。
注意:经过LORA微调训练后的checkpoint无需复制原生GLM3的tokenizer文件到其目录下。
```bash ```bash
python inference.py \ cd ./finetune_demo
--tokenizer THUDM/chatglm3-6b \ bash lora_inf.sh
--model "path to finetuned model checkpoint"
``` ```
## 推理
运行如下命令:
python ./basic_demo/cli_demo.py ## Result
程序会在命令行中进行交互式的对话,在命令行中输入指示并回车即可生成回复,输入 clear 可以清空对话历史,输入 stop 终止程序。 ### SFT微调
#### 单轮对话微调推理结果
<div align="center">
<img src="./media/result1.jpg">
</div>
### LORA微调
#### 单轮对话微调推理结果
## Result
- 推理效果如下:
<div align="center"> <div align="center">
<img src="./media/cli.png" width="650" height="100"> <img src="./media/result2.jpg">
</div> </div>
### 精度 ### 精度
......
...@@ -10,15 +10,36 @@ ...@@ -10,15 +10,36 @@
📍Experience the larger-scale ChatGLM model at <a href="https://www.chatglm.cn">chatglm.cn</a> 📍Experience the larger-scale ChatGLM model at <a href="https://www.chatglm.cn">chatglm.cn</a>
</p> </p>
## Introduction 📔
About `ChatGLM3-6B`
For more detailed usage information, please refer to:
+ [ChatGLM3 technical documentation](https://lslfd0slxc.feishu.cn/wiki/WvQbwIJ9tiPAxGk8ywDck6yfnof?from=from_copylink)
+ [Bilibili video](https://www.bilibili.com/video/BV1uC4y1J7yA)
+ [YouTube video](https://www.youtube.com/watch?v=Pw9PB6R7ORA)
ChatGLM3 is a new generation of pre-trained dialogue models jointly released by Zhipu AI and Tsinghua KEG. ChatGLM3-6B is the open-source model in the ChatGLM3 series, maintaining many excellent features of the first two generations such as smooth dialogue and low deployment threshold, while introducing the following features: ## GLM-4 Introduction
We have released the latest **GLM-4** model, which has made new breakthroughs in multiple indicators. You can directly experience our latest model in the following two channels.
+ [Chatglm Qingyan](https://www.chatglm.cn) To experience the latest version of GLM-4, including **GLM, all tools** and other functions, download the Zhipu Qingyan APP
Or use [web page](https://www.chatglm.cn).
+ [API Platform](https://open.bigmodel.cn/) The new generation API platform has been launched. You can directly access the API
Experience new models such as `GLM-4`, `GLM-3-Turbo`, `CharaterGLM-3`, and `CogView-3` on the platform.
Among them, two models, `GLM-4` and `GLM-3-Turbo`, support new functions such as `system prompt`, `function call`, `retrieval`, `Web_Search`, etc. Welcome to experience it.
+ [GLM4 API Open Source Tutorial](https://github.com/MetaGLM/glm-cookbook/) - A tutorial and basic application guide for the GLM-4 API. You are invited to explore and experiment.
For API-related inquiries, refer to this open-source tutorial, or utilize the [GLM-4 API AI Assistant](https://open.bigmodel.cn/shareapp/v1/?share_code=sQwt5qyqYVaNh1O_87p8O) for assistance with common questions.
-----
## ChatGLM3 Introduction
**ChatGLM3** is a generation of pre-trained dialogue models jointly released by Zhipu AI and Tsinghua KEG. ChatGLM3-6B is the open-source model in the ChatGLM3 series, maintaining many excellent features of the first two generations such as smooth dialogue and low deployment threshold, while introducing the following features:
1. **Stronger Base Model:** The base model of ChatGLM3-6B, ChatGLM3-6B-Base, adopts a more diverse training dataset, more sufficient training steps, and a more reasonable training strategy. Evaluations on datasets from various perspectives such as semantics, mathematics, reasoning, code, and knowledge show that **ChatGLM3-6B-Base has the strongest performance among base models below 10B**. 1. **Stronger Base Model:** The base model of ChatGLM3-6B, ChatGLM3-6B-Base, adopts a more diverse training dataset, more sufficient training steps, and a more reasonable training strategy. Evaluations on datasets from various perspectives such as semantics, mathematics, reasoning, code, and knowledge show that **ChatGLM3-6B-Base has the strongest performance among base models below 10B**.
2. **More Complete Function Support:** ChatGLM3-6B adopts a newly designed [Prompt format](PROMPT_en.md), supporting multi-turn dialogues as usual. It also natively supports [tool invocation](tool_using/README_en.md) (Function Call), code execution (Code Interpreter), and Agent tasks in complex scenarios. 2. **More Complete Function Support:** ChatGLM3-6B adopts a newly designed [Prompt format](PROMPT_en.md), supporting multi-turn dialogues as usual. It also natively supports [tool invocation](tools_using_demo/README_en.md) (Function Call), code execution (Code Interpreter), and Agent tasks in complex scenarios.
3. **More Comprehensive Open-source Series:** In addition to the dialogue model [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b), the basic model [ChatGLM3-6B-Base](https://huggingface.co/THUDM/chatglm3-6b-base), and the long-text dialogue model [ChatGLM3-6B-32K](https://huggingface.co/THUDM/chatglm3-6b-32k) have also been open-sourced. All these weights are **fully open** for academic research, and **free commercial use is also allowed** after registration via a [questionnaire](https://open.bigmodel.cn/mla/form). 3. **More Comprehensive Open-source Series:** In addition to the dialogue model [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b), the basic model [ChatGLM3-6B-Base](https://huggingface.co/THUDM/chatglm3-6b-base), the long-text dialogue model [ChatGLM3-6B-32K](https://huggingface.co/THUDM/chatglm3-6b-32k) and further strengthens the ability to understand long texts [ChatGLM3-6B-128K](https://huggingface.co/THUDM/chatglm3-6b-128k) have also been open-sourced. All these weights are **fully open** for academic research, and **free commercial use is also allowed** after registration via a [questionnaire](https://open.bigmodel.cn/mla/form).
----- -----
...@@ -28,17 +49,31 @@ Although every effort has been made to ensure the compliance and accuracy of the ...@@ -28,17 +49,31 @@ Although every effort has been made to ensure the compliance and accuracy of the
## Model List ## Model List
| Model | Seq Length | Download | Model | Seq Length | Download
| :---: |:---------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------: |:----------------:|:----------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------:
| ChatGLM3-6B | 8k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b) | ChatGLM3-6B | 8k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b)
| ChatGLM3-6B-Base | 8k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-base) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base) | ChatGLM3-6B-Base | 8k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-base) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base)
| ChatGLM3-6B-32K | 32k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-32k) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k) | ChatGLM3-6B-32K | 32k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-32k) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k)
| ChatGLM3-6B-128K | 128k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-128k)[ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-128k)
## Projects ## Projects
Open source projects that accelerate ChatGLM3:
The following excellent open source repositories have in-depth support for the ChatGLM3-6B model, and everyone is welcome to expand their learning.
Inference acceleration:
* [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): Real-time inference on your laptop accelerated by quantization, similar to llama.cpp. * [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): Real-time inference on your laptop accelerated by quantization, similar to llama.cpp.
* [ChatGLM3-TPU](https://github.com/sophgo/ChatGLM3-TPU): Using the TPU accelerated inference solution, it runs about 7.5 token/s in real time on the end-side chip BM1684X (16T@FP16, 16G DDR). * [ChatGLM3-TPU](https://github.com/sophgo/ChatGLM3-TPU): Using the TPU accelerated inference solution, it runs about 7.5 token/s in real time on the end-side chip BM1684X (16T@FP16, 16G DDR).
* [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main): A high-performance GPU-accelerated inference solution developed by NVIDIA, you can refer to these [steps](./tensorrt_llm_demo/README.md) to deploy ChatGLM3.
* [OpenVINO](https://github.com/openvinotoolkit): A high-performance CPU and GPU accelerated inference solution developed by Intel, you can refer to this [step](./Intel_device_demo/openvino_demo/README.md) to deploy the ChatGLM3-6B model
Efficient fine-tuning:
* [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): An excellent, easy-to-use and efficient fine-tuning framework.
Application framework:
* [LangChain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat): Based on large language models such as ChatGLM and application frameworks such as Langchain, open source and offline deployable retrieval enhancement generation (RAG) large Model knowledge base project.
* [BISHENG](https://github.com/dataelement/bisheng): open-source platform for developing LLM applications. It empowers and accelerates the development of LLM applications and helps users to enter the next generation of application development mode with the best experience.
## Evaluation Results ## Evaluation Results
### Typical Tasks ### Typical Tasks
...@@ -75,10 +110,7 @@ Then use pip to install the dependencies: ...@@ -75,10 +110,7 @@ Then use pip to install the dependencies:
``` ```
pip install -r requirements.txt pip install -r requirements.txt
``` ```
+ The `transformers` library version should be `4.30.2` and above, and `torch` library should be 2.0 and above to obtain the best inference performance.
+ In order to ensure that the version of `torch` is correct, please strictly follow the instructions of [official documentation](https://pytorch.org/get-started/locally/) for installation. + In order to ensure that the version of `torch` is correct, please strictly follow the instructions of [official documentation](https://pytorch.org/get-started/locally/) for installation.
+ The `gradio` library version should be the `3.x` version.
### Integrated Demo ### Integrated Demo
...@@ -128,21 +160,21 @@ git clone https://huggingface.co/THUDM/chatglm3-6b ...@@ -128,21 +160,21 @@ git clone https://huggingface.co/THUDM/chatglm3-6b
If the download from HuggingFace is slow, you can also download it from [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b). If the download from HuggingFace is slow, you can also download it from [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b).
# Model Fine-tuning # Model Fine-tuning
Please refer to the dialog model fine-tuning [ChatGLM3-6B fine-tuning example](finetune_chatmodel_demo/README.md), or the base model fine-tuning [ChatGLM3-6B-base fine-tuning example](finetune_basemodel_demo/README.md).
Please note that different fine-tuning scripts correspond to different models. Please select the corresponding model according to your needs. We provide a basic fine-tuning framework for ChatGLM3-6B. You can use it to fine-tune the model on your own dataset. For more details, please refer to [Fine-tuning Demo](finetune_demo/README_en.md).
### Web-based Dialogue Demo ### Web-based Dialogue Demo
![web-demo](resources/web-demo.gif) ![web-demo](resources/web-demo.gif)
You can launch a web-based demo using Gradio with the following command: You can launch a web-based demo using Gradio with the following command:
```shell ```shell
python web_demo.py python web_demo_gradio.py
``` ```
![web-demo](resources/web-demo2.png) ![web-demo](resources/web-demo2.png)
You can launch a web-based demo using Streamlit with the following command: You can launch a web-based demo using Streamlit with the following command:
```shell ```shell
streamlit run web_demo2.py streamlit run web_demo_streamlit.py
``` ```
The web-based demo will run a Web Server and output an address. You can use it by opening the output address in a browser. Based on tests, the web-based demo using Streamlit runs more smoothly. The web-based demo will run a Web Server and output an address. You can use it by opening the output address in a browser. Based on tests, the web-based demo using Streamlit runs more smoothly.
...@@ -159,19 +191,34 @@ python cli_demo.py ...@@ -159,19 +191,34 @@ python cli_demo.py
The program will interact in the command line, enter instructions in the command line and hit enter to generate a response. Enter `clear` to clear the dialogue history, enter `stop` to terminate the program. The program will interact in the command line, enter instructions in the command line and hit enter to generate a response. Enter `clear` to clear the dialogue history, enter `stop` to terminate the program.
### API Deployment ### OpenAI API /Zhipu API Demo
Thanks to [@xusenlinzy](https://github.com/xusenlinzy) for implementing the OpenAI format streaming API deployment, which can serve as the backend for any ChatGPT-based application, such as [ChatGPT-Next-Web](https://github.com/Yidadaa/ChatGPT-Next-Web). You can deploy it by running [openai_api.py](openai_api_demo/openai_api.py) in the repository: We have launched open source model API deployment code in OpenAI / ZhipuAI format, which can be used as the backend of any ChatGPT-based application.
Currently, you can deploy by running [api_server.py](openai_api_demo/api_server.py) in the warehouse
```shell ```shell
cd openai_api_demo cd openai_api_demo
python openai_api.py python api_server.py
``` ```
Also, we have written a sample code to test the performance of the API calls. This can be tested by running [openai_api_request.py](openai_api_demo/openai_api_request.py) in the repository
At the same time, we also wrote a sample code to test the performance of API calls.
+ OpenAI test script: [openai_api_request.py](openai_api_demo/openai_api_request.py)
+ ZhipuAI test script: [zhipu_api_request.py](openai_api_demo/zhipu_api_request.py)
+ Test with Curl + Test with Curl
+ chat Curl test
```shell
curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{\"model\": \"chatglm3-6b\", \"messages\": [{\"role\": \"system\", \"content\": \"You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.\"}, {\"role\": \"user\", \"content\": \"你好,给我讲一个故事,大概100字\"}], \"stream\": false, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}"
````
+ agent-chat Curl test
```shell ```shell
curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \ curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \
-H "Content-Type: application/json" \\ -H "Content-Type: application/json" \
-d "{\"model\": \"chatglm3-6b\", \"messages\": [{\"role\": \"system\", \"content\": \"You are ChatGLM3, a large language model trained by Zhipu. Follow the user's instructions carefully. Respond using markdown.\"}, {\"role\": \"user\", \"content\": \"Hello, tell me a story, about 100 words\"}], \"stream\": false, \"max_title": \"\". false, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}" -d "{\"model\": \"chatglm3-6b\", \"agent\": true, \"messages\": [{\"role\": \"user\", \"content\": \"37乘以8加7除2等于多少?\"}], \"stream\": true, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}"
```` ````
+ Testing with Python + Testing with Python
```shell ```shell
cd openai_api_demo cd openai_api_demo
...@@ -181,7 +228,7 @@ If the test is successful, the model should return a story. ...@@ -181,7 +228,7 @@ If the test is successful, the model should return a story.
### Tool Invocation ### Tool Invocation
For methods of tool invocation, please refer to [Tool Invocation](tool_using/README_en.md). For methods of tool invocation, please refer to [Tool Invocation](tools_using_demo/README_en.md).
## Low-Cost Deployment ## Low-Cost Deployment
...@@ -217,15 +264,18 @@ Loading the half-precision ChatGLM3-6B model requires about 13GB of memory. Mach ...@@ -217,15 +264,18 @@ Loading the half-precision ChatGLM3-6B model requires about 13GB of memory. Mach
### Multi-GPU Deployment ### Multi-GPU Deployment
If you have multiple GPUs, but each GPU's VRAM size is not enough to accommodate the complete model, then the model can be split across multiple GPUs. First, install accelerate: `pip install accelerate`, and then load the model through the following methods: If you have multiple GPUs, but each GPU's VRAM size is not enough to accommodate the complete model, then the model can be split across multiple GPUs. First, install accelerate: `pip install accelerate`, and then load the model as usual.
```python
from utils import load_model_on_gpus
model = load_model_on_gpus("THUDM/chatglm3-6b", num_gpus=2) ### OpenVINO Demo
```
ChatGLM3-6B already supports the use of OpenVINO
The toolkit accelerates inference and has a greater inference speed improvement on Intel's GPUs and GPU devices. For specific usage, please refer to [OpenVINO Demo](Intel_device_demo/openvino_demo/README.md).
### TensorRT-LLM Demo
This allows the model to be deployed on two GPUs for inference. You can change `num_gpus` to the number of GPUs you want to use. It is evenly split by default, but you can also pass the `device_map` parameter to specify it yourself. ChatGLM3-6B now supports accelerated inference using the TensorRT-LLM toolkit, significantly improving model inference speed. For specific usage, please refer to the [TensorRT-LLM Demo](tensorrt_llm_demo/tensorrt_llm_cli_demo.py) and the official technical documentation.
## Citation ## Citation
......
# ChatGLM3
<p align="center">
🤗 <a href="https://huggingface.co/THUDM/chatglm3-6b" target="_blank">HF Repo</a> • 🤖 <a href="https://modelscope.cn/models/ZhipuAI/chatglm3-6b" target="_blank">ModelScope</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
</p>
<p align="center">
👋 加入我们的 <a href="https://join.slack.com/t/chatglm/shared_invite/zt-25ti5uohv-A_hs~am_D3Q8XPZMpj7wwQ" target="_blank">Slack</a><a href="resources/WECHAT.md" target="_blank">微信</a>
</p>
<p align="center">
📍在 <a href="https://www.chatglm.cn">chatglm.cn</a> 体验更大规模的 ChatGLM 模型。
</p>
[Read this in English.](./README_en.md)
📔 更为详细的使用信息,可以参考:[ChatGLM3技术文档](https://lslfd0slxc.feishu.cn/wiki/WvQbwIJ9tiPAxGk8ywDck6yfnof?from=from_copylink)
## 介绍
ChatGLM3 是智谱AI和清华大学 KEG 实验室联合发布的新一代对话预训练模型。ChatGLM3-6B 是 ChatGLM3 系列中的开源模型,在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上,ChatGLM3-6B 引入了如下特性:
1. **更强大的基础模型:** ChatGLM3-6B 的基础模型 ChatGLM3-6B-Base 采用了更多样的训练数据、更充分的训练步数和更合理的训练策略。在语义、数学、推理、代码、知识等不同角度的数据集上测评显示,**ChatGLM3-6B-Base 具有在 10B 以下的基础模型中最强的性能**
2. **更完整的功能支持:** ChatGLM3-6B 采用了全新设计的 [Prompt 格式](PROMPT.md),除正常的多轮对话外。同时原生支持[工具调用](tool_using/README.md)(Function Call)、代码执行(Code Interpreter)和 Agent 任务等复杂场景。
3. **更全面的开源序列:** 除了对话模型 [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b) 外,还开源了基础模型 [ChatGLM3-6B-Base](https://huggingface.co/THUDM/chatglm3-6b-base)、长文本对话模型 [ChatGLM3-6B-32K](https://huggingface.co/THUDM/chatglm3-6b-32k)。以上所有权重对学术研究**完全开放**,在填写[问卷](https://open.bigmodel.cn/mla/form)进行登记后**亦允许免费商业使用**
-----
ChatGLM3 开源模型旨在与开源社区一起推动大模型技术发展,恳请开发者和大家遵守[开源协议](MODEL_LICENSE),勿将开源模型和代码及基于开源项目产生的衍生物用于任何可能给国家和社会带来危害的用途以及用于任何未经过安全评估和备案的服务。目前,本项目团队未基于 **ChatGLM3 开源模型**开发任何应用,包括网页端、安卓、苹果 iOS 及 Windows App 等应用。
尽管模型在训练的各个阶段都尽力确保数据的合规性和准确性,但由于 ChatGLM3-6B 模型规模较小,且模型受概率随机性因素影响,无法保证输出内容的准确。同时模型的输出容易被用户的输入误导。**本项目不承担开源模型和代码导致的数据安全、舆情风险或发生任何模型被误导、滥用、传播、不当利用而产生的风险和责任。**
## 模型列表
| Model | Seq Length | Download
| :---: |:---------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------:
| ChatGLM3-6B | 8k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b)
| ChatGLM3-6B-Base | 8k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-base) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base)
| ChatGLM3-6B-32K | 32k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-32k) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k)
## 友情链接
对 ChatGLM3 进行加速的开源项目:
* [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): 类似 llama.cpp 的量化加速推理方案,实现笔记本上实时对话
* [ChatGLM3-TPU](https://github.com/sophgo/ChatGLM3-TPU): 采用TPU加速推理方案,在算能端侧芯片BM1684X(16T@FP16,内存16G)上实时运行约7.5 token/s
## 评测结果
### 典型任务
我们选取了 8 个中英文典型数据集,在 ChatGLM3-6B (base) 版本上进行了性能测试。
| Model | GSM8K | MATH | BBH | MMLU | C-Eval | CMMLU | MBPP | AGIEval |
|------------------|:-----:|:----:|:----:|:----:|:------:|:-----:|:----:|:-------:|
| ChatGLM2-6B-Base | 32.4 | 6.5 | 33.7 | 47.9 | 51.7 | 50.0 | - | - |
| Best Baseline | 52.1 | 13.1 | 45.0 | 60.1 | 63.5 | 62.2 | 47.5 | 45.8
| ChatGLM3-6B-Base | 72.3 | 25.7 | 66.1 | 61.4 | 69.0 | 67.5 | 52.4 | 53.7 |
> Best Baseline 指的是截止 2023年10月27日、模型参数在 10B 以下、在对应数据集上表现最好的预训练模型,不包括只针对某一项任务训练而未保持通用能力的模型。
> 对 ChatGLM3-6B-Base 的测试中,BBH 采用 3-shot 测试,需要推理的 GSM8K、MATH 采用 0-shot CoT 测试,MBPP 采用 0-shot 生成后运行测例计算 Pass@1 ,其他选择题类型数据集均采用 0-shot 测试。
我们在多个长文本应用场景下对 ChatGLM3-6B-32K 进行了人工评估测试。与二代模型相比,其效果平均提升了超过 50%。在论文阅读、文档摘要和财报分析等应用中,这种提升尤为显著。此外,我们还在 LongBench 评测集上对模型进行了测试,具体结果如下表所示
| Model | 平均 | Summary | Single-Doc QA | Multi-Doc QA | Code | Few-shot | Synthetic |
|----------------------|:-----:|:----:|:----:|:----:|:------:|:-----:|:-----:|
| ChatGLM2-6B-32K | 41.5 | 24.8 | 37.6 | 34.7 | 52.8 | 51.3 | 47.7 |
| ChatGLM3-6B-32K | 50.2 | 26.6 | 45.8 | 46.1 | 56.2 | 61.2 | 65 |
## 使用方式
### 环境安装
首先需要下载本仓库:
```shell
git clone https://github.com/THUDM/ChatGLM3
cd ChatGLM3
```
然后使用 pip 安装依赖:
```
pip install -r requirements.txt
```
+ `transformers` 库版本应该 `4.30.2` 以及以上的版本 ,`torch` 库版本应为 2.0 及以上的版本,以获得最佳的推理性能。
+ 为了保证 `torch` 的版本正确,请严格按照 [官方文档](https://pytorch.org/get-started/locally/) 的说明安装。
+ `gradio` 库版本应该为 `3.x` 的版本。
### 综合 Demo
我们提供了一个集成以下三种功能的综合 Demo,运行方法请参考 [综合 Demo](composite_demo/README.md)
- Chat: 对话模式,在此模式下可以与模型进行对话。
- Tool: 工具模式,模型除了对话外,还可以通过工具进行其他操作。
<img src="resources/tool.png" width="400">
- Code Interpreter: 代码解释器模式,模型可以在一个 Jupyter 环境中执行代码并获取结果,以完成复杂任务。
<img src="resources/heart.png" width="400">
### 代码调用
可以通过如下代码调用 ChatGLM 模型来生成对话:
```python
>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
>>> model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True, device='cuda')
>>> model = model.eval()
>>> response, history = model.chat(tokenizer, "你好", history=[])
>>> print(response)
你好👋!我是人工智能助手 ChatGLM3-6B,很高兴见到你,欢迎问我任何问题
>>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
>>> print(response)
晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法:
1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡尽量在每天的相同时间上床,并在同一时间起床
2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜可以使用舒适的床上用品,并保持房间通风
3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡
4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐
5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠
6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡试着慢慢吸气,保持几秒钟,然后缓慢呼气
如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议
```
#### 从本地加载模型
以上代码会由 `transformers` 自动下载模型实现和参数。完整的模型实现在 [Hugging Face Hub](https://huggingface.co/THUDM/chatglm3-6b)。如果你的网络环境较差,下载模型参数可能会花费较长时间甚至失败。此时可以先将模型下载到本地,然后从本地加载。
从 Hugging Face Hub 下载模型需要先[安装Git LFS](https://docs.github.com/zh/repositories/working-with-files/managing-large-files/installing-git-large-file-storage),然后运行
```Shell
git clone https://huggingface.co/THUDM/chatglm3-6b
```
如果从你从 HuggingFace 下载比较慢,也可以从 [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b)
中下载。
### 模型微调
请参考对话模型微调 [ChatGLM3-6B 微调示例](finetune_chatmodel_demo/README.md),或基座模型微调 [ChatGLM3-6B-base 微调示例](finetune_basemodel_demo/README.md)
请注意,不同的微调脚本对应的模型并不相同,请根据需要选择对应的模型。
### 网页版对话 Demo
![web-demo](resources/web-demo.gif)
可以通过以下命令启动基于 Gradio 的网页版 demo:
```shell
python web_demo.py
```
![web-demo](resources/web-demo2.png)
可以通过以下命令启动基于 Streamlit 的网页版 demo:
```shell
streamlit run web_demo2.py
```
网页版 demo 会运行一个 Web Server,并输出地址。在浏览器中打开输出的地址即可使用。 经测试,基于 Streamlit 的网页版 Demo 会更流畅。
### 命令行对话 Demo
![cli-demo](resources/cli-demo.png)
运行仓库中 [cli_demo.py](basic_demo/cli_demo.py)
```shell
python cli_demo.py
```
程序会在命令行中进行交互式的对话,在命令行中输入指示并回车即可生成回复,输入 `clear` 可以清空对话历史,输入 `stop` 终止程序。
### LangChain Demo
请参考 [基于 LangChain 的工具调用 Demo](langchain_demo/README.md)
### 工具调用
关于工具调用的方法请参考 [工具调用](tool_using/README.md)
### API 部署
感谢 [@xusenlinzy](https://github.com/xusenlinzy) 实现了 OpenAI 格式的流式 API 部署,可以作为任意基于 ChatGPT 的应用的后端,比如 [ChatGPT-Next-Web](https://github.com/Yidadaa/ChatGPT-Next-Web)。可以通过运行仓库中的[openai_api.py](openai_api_demo/openai_api.py) 进行部署:
```shell
cd openai_api_demo
python openai_api.py
```
同时,我们也书写了一个示例代码,用来测试API调用的性能。可以通过运行仓库中的[openai_api_request.py](openai_api_demo/openai_api_request.py) 进行测试
+ 使用Curl进行测试
```shell
curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{\"model\": \"chatglm3-6b\", \"messages\": [{\"role\": \"system\", \"content\": \"You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.\"}, {\"role\": \"user\", \"content\": \"你好,给我讲一个故事,大概100字\"}], \"stream\": false, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}"
````
+ 使用Python进行测试
```shell
cd openai_api_demo
python openai_api_request.py
```
如果测试成功,则模型应该返回一段故事。
## 低成本部署
### 模型量化
默认情况下,模型以 FP16 精度加载,运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限,可以尝试以量化方式加载模型,使用方法如下:
```python
model = AutoModel.from_pretrained("THUDM/chatglm3-6b",trust_remote_code=True).quantize(4).cuda()
```
模型量化会带来一定的性能损失,经过测试,ChatGLM3-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。
### CPU 部署
如果你没有 GPU 硬件的话,也可以在 CPU 上进行推理,但是推理速度会更慢。使用方法如下(需要大概 32GB 内存)
```python
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).float()
```
### Mac 部署
对于搭载了 Apple Silicon 或者 AMD GPU 的 Mac,可以使用 MPS 后端来在 GPU 上运行 ChatGLM3-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly(正确的版本号应该是2.x.x.dev2023xxxx,而不是 2.x.x)。
目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载,并使用 mps 后端:
```python
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
```
加载半精度的 ChatGLM3-6B 模型需要大概 13GB 内存。内存较小的机器(比如 16GB 内存的 MacBook Pro),在空余内存不足的情况下会使用硬盘上的虚拟内存,导致推理速度严重变慢。
### 多卡部署
如果你有多张 GPU,但是每张 GPU 的显存大小都不足以容纳完整的模型,那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`,然后通过如下方法加载模型:
```python
from utils import load_model_on_gpus
model = load_model_on_gpus("THUDM/chatglm3-6b", num_gpus=2)
```
即可将模型部署到两张 GPU 上进行推理。你可以将 `num_gpus` 改为你希望使用的 GPU 数。默认是均匀切分的,你也可以传入 `device_map` 参数来自己指定。
## 引用
如果你觉得我们的工作有帮助的话,请考虑引用下列论文。
```
@article{zeng2022glm,
title={Glm-130b: An open bilingual pre-trained model},
author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
journal={arXiv preprint arXiv:2210.02414},
year={2022}
}
```
```
@inproceedings{du2022glm,
title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={320--335},
year={2022}
}
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment