Merge branch 'master' into 'master'

ChatGLM3-6B 微调训练 See merge request !2

Merge branch 'master' into 'master'
ChatGLM3-6B 微调训练 See merge request !2
467ec853 · lvzhen · 971c0aee · 0006ad16 · 467ec853 · 467ec853
Commit 467ec853 authored May 10, 2024 by lvzhen
20 changed files
--- a/.github/ISSUE_TEMPLATE/bug_report.yaml
+++ b/.github/ISSUE_TEMPLATE/bug_report.yaml
+name: "\U0001F41B Bug Report"
+description: Submit a bug report to help us improve ChatGLM3 / 提交一个 Bug 问题报告来帮助我们改进 ChatGLM3
+body:
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info / 系統信息
+      description: Your operating environment / 您的运行环境信息
+      placeholder: Includes Cuda version, Transformers version, Python version, operating system, hardware information (if you suspect a hardware problem)... / 包括Cuda版本，Transformers版本，Python版本，操作系统，硬件信息(如果您怀疑是硬件方面的问题)...
+    validations:
+      required: true
+  - type: textarea
+    id: who-can-help
+    attributes:
+      label: Who can help? / 谁可以帮助到您？
+      description: |
+        Your issue will be replied to more quickly if you can figure out the right person to tag with @
+        All issues are read by one of the maintainers, so if you don't know who to tag, just leave this blank and our maintainer will ping the right person.
+        Please tag fewer than 3 people.
+        如果您能找到合适的标签 @，您的问题会更快得到回复。
+        所有问题都会由我们的维护者阅读，如果您不知道该标记谁，只需留空，我们的维护人员会找到合适的开发组成员来解决问题。
+        标记的人数应该不超过 3 个人。
+        Related demo leader / 相关demo负责人 :
+        - finetune_demo: @Btlmd
+        - langchain_demo: @yincf
+        - composite_demo: @abmfy
+        If it's not a bug in these three subsections, you may not specify the helper. Our maintainer will find the right person in the development group to solve the problem.
+        如果不是这三个子版块的bug，您可以不指明帮助者，我们的维护人员会找到合适的开发组成员来解决问题。
+      placeholder: "@Username ..."
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information / 问题信息
+      description: 'The problem arises when using: / 问题出现在'
+      options:
+        - label: "The official example scripts / 官方的示例脚本"
+        - label: "My own modified scripts / 我自己修改的脚本和任务"
+  - type: textarea
+    id: reproduction
+    validations:
+      required: true
+    attributes:
+      label: Reproduction / 复现过程
+      description: |
+        Please provide a code example that reproduces the problem you encountered, preferably with a minimal reproduction unit.
+        If you have code snippets, error messages, stack traces, please provide them here as well.
+        Please format your code correctly using code tags. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        Do not use screenshots, as they are difficult to read and (more importantly) do not allow others to copy and paste your code.
+        请提供能重现您遇到的问题的代码示例,最好是最小复现单元。
+        如果您有代码片段、错误信息、堆栈跟踪，也请在此提供。
+        请使用代码标签正确格式化您的代码。请参见 https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        请勿使用截图，因为截图难以阅读，而且（更重要的是）不允许他人复制粘贴您的代码。
+      placeholder: |
+        Steps to reproduce the behavior/复现Bug的步骤:
+          1.
+          2.
+          3.
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior / 期待表现
+      description: "A clear and concise description of what you would expect to happen. /简单描述您期望发生的事情。"
\ No newline at end of file
--- a/.github/ISSUE_TEMPLATE/feature-request.yaml
+++ b/.github/ISSUE_TEMPLATE/feature-request.yaml
+name: "\U0001F680 Feature request"
+description: Submit a request for a new ChatGLM3 feature / 提交一个新的 ChatGLM3 的功能建议
+labels: [ "feature" ]
+body:
+  - type: textarea
+    id: feature-request
+    validations:
+      required: true
+    attributes:
+      label: Feature request  / 功能建议
+      description: |
+        A brief description of the functional proposal. Links to corresponding papers and code are desirable.
+        对功能建议的简述。最好提供对应的论文和代码链接
+  - type: textarea
+    id: motivation
+    validations:
+      required: true
+    attributes:
+      label: Motivation / 动机
+      description: |
+        Your motivation for making the suggestion. If that motivation is related to another GitHub issue, link to it here.
+        您提出建议的动机。如果该动机与另一个 GitHub 问题有关，请在此处提供对应的链接。
+  - type: textarea
+    id: contribution
+    validations:
+      required: true
+    attributes:
+      label: Your contribution / 您的贡献
+      description: |
+        Your PR link or any other link you can help with.
+        您的PR链接或者其他您能提供帮助的链接。
\ No newline at end of file
--- a/.github/PULL_REQUEST_TEMPLATE/pr_template.md
+++ b/.github/PULL_REQUEST_TEMPLATE/pr_template.md
+#  Raise valuable PR / 提出有价值的PR
+## Caution/ 注意事项:
+Users should keep the following points in mind when submitting PRs:
+1. The proposed PR should be about this project. 
+2. the proposed PR should be relevant, if there are multiple ideas and optimizations, they should be assigned to different PRs.
+用户在提交PR时候应该注意以下几点:
+1. 提出的PR应该是关于本项目的。
+2. 提出的PR应该具有针对性，如果具有多个不同的想法和优化方案，应该分配到不同的PR中。
+## 不应该提出的PR / PRs that should not be proposed
+If a developer proposes a PR about any of the following, it may be closed or Rejected.
+1. those that don't describe improvement options.
+2. multiple issues of different types combined in one PR.
+3. The proposed PR is highly duplicative of already existing PRs.
+如果开发者提出关于以下方面的PR，则可能会被直接关闭或拒绝通过。
+1. 没有说明改进方案的。
+2. 多个不同类型的问题合并在一个PR中的。
+3. 提出的PR与已经存在的PR高度重复的。
+# 检查您的PR
+- [ ] Have you read the Contributor Guidelines, Pull Request section? / 您是否阅读了贡献者指南、Pull Request 部分？
+- [ ] Has this been discussed/approved via a Github issue or forum? If so, add a link. / 是否通过 Github 问题或论坛讨论/批准过？如果是，请添加链接。
+- [ ] Did you make sure you updated the documentation with your changes? Here are the Documentation Guidelines, and here are the Documentation Formatting Tips. /您是否确保根据您的更改更新了文档？这里是文档指南，这里是文档格式化技巧。
+- [ ] Did you write new required tests? / 您是否编写了新的必要测试？
+- [ ]  Are your PRs for only one issue / 您的PR是否仅针对一个问题
\ No newline at end of file
--- a/.gitignore
+++ b/.gitignore
+__pycache__
+# finetune_demo: generated & downloaded files
+finetune_demo/output
+finetune_demo/data
+finetune_demo/formatted_data
+ToolAlpaca/
+AdvertiseGen/
+*.gz
+*.idea
+.DS_Store
\ No newline at end of file
--- a/Dockerfile
+++ b/Dockerfile
-FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
-COPY requirements.txt requirements.txt
-RUN source /opt/dtk-23.04/env.sh
-RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo 'Asia/Shanghai' >/etc/timezone 
-ENV LANG C.UTF-8
-RUN pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
--- a/Intel_device_demo/README.md
+++ b/Intel_device_demo/README.md
+# Intel Device Demo
+本文件夹主要辅助开发者 在 Intel 设备上加速部署 ChatGLM3-6B 模型。
+## 1. 硬件准备
+本文件夹中的设备支持列表包括：
+- Intel CPU 系列, 包含个人CPU 和 服务器 / 工作站 CPU
+- Intel Arc 独立显卡系列，包括 Arc A770 等显卡。
+- Intel CPU 核显系列
+- 其他理论支持 OpenVINO 加速的Intel 工具套件。
+## 2. 文件目录
+- IPEX_llm_xxx_demo: IPEX-LLM 是一个为Intel XPU(Xeon/Core/Flex/Arc/PVC)打造的低精度轻量级大语言模型库，在Intel平台上具有广泛的模型支持、最低的延迟和最小的内存占用，实现加速模型部署示例。
+- OpenVINO_demo: 使用 Intel OpenVINO 推理加速框架，实现加速模型部署示例。
+- Pytorch_demo (暂未推出) : 使用 Intel Pytorch Extension 实现在 Pytorch 环境上开发（适用于 Intel Arc 系列 GPU）
--- a/openai_api_demo/openai_api.py
+++ b/openai_api_demo/openai_api.py
-# coding=utf-8
+"""
-# Implements API for ChatGLM3-6B in OpenAI's format. (https://platform.openai.com/docs/api-reference/chat)
+This script implements an API for the ChatGLM3-6B model,
-# Usage: python openai_api.py
+formatted similarly to OpenAI's API (https://platform.openai.com/docs/api-reference/chat).
-# Visit http://localhost:8000/docs for documents.
+It's designed to be run as a web server using FastAPI and uvicorn,
+making the ChatGLM3-6B model accessible through OpenAI Client.
-# 在OpenAI的API中，max_tokens 等价于 HuggingFace 的 max_new_tokens 而不是 max_length，。
-# 例如，对于6b模型，设置max_tokens = 8192，则会报错，因为扣除历史记录和提示词后，模型不能输出那么多的tokens。
+Key Components and Features:
+- Model and Tokenizer Setup: Configures the model and tokenizer paths and loads them.
+- FastAPI Configuration: Sets up a FastAPI application with CORS middleware for handling cross-origin requests.
+- API Endpoints:
+  - "/v1/models": Lists the available models, specifically ChatGLM3-6B.
+  - "/v1/chat/completions": Processes chat completion requests with options for streaming and regular responses.
+  - "/v1/embeddings": Processes Embedding request of a list of text inputs.
+- Token Limit Caution: In the OpenAI API, 'max_tokens' is equivalent to HuggingFace's 'max_new_tokens', not 'max_length'.
+For instance, setting 'max_tokens' to 8192 for a 6b model would result in an error due to the model's inability to output
+that many tokens after accounting for the history and prompt tokens.
+- Stream Handling and Custom Functions: Manages streaming responses and custom function calls within chat responses.
+- Pydantic Models: Defines structured models for requests and responses, enhancing API documentation and type safety.
+- Main Execution: Initializes the model and tokenizer, and starts the FastAPI app on the designated host and port.
+Note:
+    This script doesn't include the setup for special tokens or multi-GPU support by default.
+    Users need to configure their special tokens and can enable multi-GPU support as per the provided instructions.
+    Embedding Models only support in One GPU.
+"""
 import os
 import time
-from contextlib import asynccontextmanager
+import tiktoken
-from typing import List, Literal, Optional, Union
 import torch
 import uvicorn
-from fastapi import FastAPI, HTTPException
+from fastapi import FastAPI, HTTPException, Response
 from fastapi.middleware.cors import CORSMiddleware
+from contextlib import asynccontextmanager
+from typing import List, Literal, Optional, Union
 from loguru import logger
 from pydantic import BaseModel, Field
+from ipex_llm.transformers import AutoModel
+from transformers import AutoTokenizer
+from utils import process_response, generate_chatglm3, generate_stream_chatglm3
+# from sentence_transformers import SentenceTransformer
 from sse_starlette.sse import EventSourceResponse
-from transformers import AutoTokenizer, AutoModel
-from utils import process_response, generate_chatglm3, generate_stream_chatglm3
+# Set up limit request time
+EventSourceResponse.DEFAULT_PING_INTERVAL = 1000
+# set LLM path
 MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/chatglm3-6b')
 TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH)
-DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
+# set Embedding Model path
+EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', 'BAAI/bge-large-zh-v1.5')
 @asynccontextmanager
-async def lifespan(app: FastAPI):  # collects GPU memory
+async def lifespan(app: FastAPI):
    yield
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
@@ -79,6 +108,33 @@ class DeltaMessage(BaseModel):
    function_call: Optional[FunctionCallResponse] = None
+## for Embedding
+class EmbeddingRequest(BaseModel):
+    input: List[str]
+    model: str
+class CompletionUsage(BaseModel):
+    prompt_tokens: int
+    completion_tokens: int
+    total_tokens: int
+class EmbeddingResponse(BaseModel):
+    data: list
+    model: str
+    object: str
+    usage: CompletionUsage
+# for ChatCompletionRequest
+class UsageInfo(BaseModel):
+    prompt_tokens: int = 0
+    total_tokens: int = 0
+    completion_tokens: Optional[int] = 0
 class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[ChatMessage]
@@ -86,8 +142,7 @@ class ChatCompletionRequest(BaseModel):
    top_p: Optional[float] = 0.8
    max_tokens: Optional[int] = None
    stream: Optional[bool] = False
-    functions: Optional[Union[dict, List[dict]]] = None
+    tools: Optional[Union[dict, List[dict]]] = None
-    # Additional parameters
    repetition_penalty: Optional[float] = 1.1
@@ -98,29 +153,68 @@ class ChatCompletionResponseChoice(BaseModel):
 class ChatCompletionResponseStreamChoice(BaseModel):
-    index: int
    delta: DeltaMessage
    finish_reason: Optional[Literal["stop", "length", "function_call"]]
+    index: int
-class UsageInfo(BaseModel):
-    prompt_tokens: int = 0
-    total_tokens: int = 0
-    completion_tokens: Optional[int] = 0
 class ChatCompletionResponse(BaseModel):
    model: str
+    id: str
    object: Literal["chat.completion", "chat.completion.chunk"]
    choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
    created: Optional[int] = Field(default_factory=lambda: int(time.time()))
    usage: Optional[UsageInfo] = None
+@app.get("/health")
+async def health() -> Response:
+    """Health check."""
+    return Response(status_code=200)
+@app.post("/v1/embeddings", response_model=EmbeddingResponse)
+async def get_embeddings(request: EmbeddingRequest):
+    embeddings = [embedding_model.encode(text) for text in request.input]
+    embeddings = [embedding.tolist() for embedding in embeddings]
+    def num_tokens_from_string(string: str) -> int:
+        """
+        Returns the number of tokens in a text string.
+        use cl100k_base tokenizer
+        """
+        encoding = tiktoken.get_encoding('cl100k_base')
+        num_tokens = len(encoding.encode(string))
+        return num_tokens
+    response = {
+        "data": [
+            {
+                "object": "embedding",
+                "embedding": embedding,
+                "index": index
+            }
+            for index, embedding in enumerate(embeddings)
+        ],
+        "model": request.model,
+        "object": "list",
+        "usage": CompletionUsage(
+            prompt_tokens=sum(len(text.split()) for text in request.input),
+            completion_tokens=0,
+            total_tokens=sum(num_tokens_from_string(text) for text in request.input),
+        )
+    }
+    return response
 @app.get("/v1/models", response_model=ModelList)
 async def list_models():
-    model_card = ModelCard(id="chatglm3-6b")
+    model_card = ModelCard(
-    return ModelList(data=[model_card])
+        id="chatglm3-6b"
+    )
+    return ModelList(
+        data=[model_card]
+    )
 @app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
@@ -138,24 +232,74 @@ async def create_chat_completion(request: ChatCompletionRequest):
        echo=False,
        stream=request.stream,
        repetition_penalty=request.repetition_penalty,
-        functions=request.functions,
+        tools=request.tools,
    )
    logger.debug(f"==== request ====\n{gen_params}")
    if request.stream:
-        generate = predict(request.model, gen_params)
-        return EventSourceResponse(generate, media_type="text/event-stream")
+        # Use the stream mode to read the first few characters, if it is not a function call, direct stram output
+        predict_stream_generator = predict_stream(request.model, gen_params)
+        output = next(predict_stream_generator)
+        if not contains_custom_function(output):
+            return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")
+        # Obtain the result directly at one time and determine whether tools needs to be called.
+        logger.debug(f"First result output：\n{output}")
+        function_call = None
+        if output and request.tools:
+            try:
+                function_call = process_response(output, use_tool=True)
+            except:
+                logger.warning("Failed to parse tool call")
+        # CallFunction
+        if isinstance(function_call, dict):
+            function_call = FunctionCallResponse(**function_call)
+            """
+            In this demo, we did not register any tools.
+            You can use the tools that have been implemented in our `tools_using_demo` and implement your own streaming tool implementation here.
+            Similar to the following method:
+                function_args = json.loads(function_call.arguments)
+                tool_response = dispatch_tool(tool_name: str, tool_params: dict)
+            """
+            tool_response = ""
+            if not gen_params.get("messages"):
+                gen_params["messages"] = []
+            gen_params["messages"].append(ChatMessage(
+                role="assistant",
+                content=output,
+            ))
+            gen_params["messages"].append(ChatMessage(
+                role="function",
+                name=function_call.name,
+                content=tool_response,
+            ))
+            # Streaming output of results after function calls
+            generate = predict(request.model, gen_params)
+            return EventSourceResponse(generate, media_type="text/event-stream")
+        else:
+            # Handled to avoid exceptions in the above parsing function process.
+            generate = parse_output_text(request.model, output)
+            return EventSourceResponse(generate, media_type="text/event-stream")
+    # Here is the handling of stream = False
    response = generate_chatglm3(model, tokenizer, gen_params)
    # Remove the first newline character
    if response["text"].startswith("\n"):
        response["text"] = response["text"][1:]
    response["text"] = response["text"].strip()
    usage = UsageInfo()
    function_call, finish_reason = None, "stop"
-    if request.functions:
+    if request.tools:
        try:
            function_call = process_response(response["text"], use_tool=True)
        except:
@@ -181,7 +325,14 @@ async def create_chat_completion(request: ChatCompletionRequest):
    task_usage = UsageInfo.model_validate(response["usage"])
    for usage_key, usage_value in task_usage.model_dump().items():
        setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)
-    return ChatCompletionResponse(model=request.model, choices=[choice_data], object="chat.completion", usage=usage)
+    return ChatCompletionResponse(
+        model=request.model,
+        id="",  # for open_source model, id is empty
+        choices=[choice_data],
+        object="chat.completion",
+        usage=usage
+    )
 async def predict(model_id: str, params: dict):
@@ -192,7 +343,7 @@ async def predict(model_id: str, params: dict):
        delta=DeltaMessage(role="assistant"),
        finish_reason=None
    )
-    chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
+    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
    previous_text = ""
@@ -210,7 +361,8 @@ async def predict(model_id: str, params: dict):
            try:
                function_call = process_response(decoded_unicode, use_tool=True)
            except:
-                logger.warning("Failed to parse tool call, maybe the response is not a tool call or have been answered.")
+                logger.warning(
+                    "Failed to parse tool call, maybe the response is not a tool call or have been answered.")
        if isinstance(function_call, dict):
            function_call = FunctionCallResponse(**function_call)
@@ -226,7 +378,12 @@ async def predict(model_id: str, params: dict):
            delta=delta,
            finish_reason=finish_reason
        )
-        chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
+        chunk = ChatCompletionResponse(
+            model=model_id,
+            id="",
+            choices=[choice_data],
+            object="chat.completion.chunk"
+        )
        yield "{}".format(chunk.model_dump_json(exclude_unset=True))
    choice_data = ChatCompletionResponseStreamChoice(
@@ -234,16 +391,141 @@ async def predict(model_id: str, params: dict):
        delta=DeltaMessage(),
        finish_reason="stop"
    )
-    chunk = ChatCompletionResponse(model=model_id, choices=[choice_data], object="chat.completion.chunk")
+    chunk = ChatCompletionResponse(
+        model=model_id,
+        id="",
+        choices=[choice_data],
+        object="chat.completion.chunk"
+    )
    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
    yield '[DONE]'
-if __name__ == "__main__":
+def predict_stream(model_id, gen_params):
+    """
+    The function call is compatible with stream mode output.
+    The first seven characters are determined.
+    If not a function call, the stream output is directly generated.
+    Otherwise, the complete character content of the function call is returned.
+    :param model_id:
+    :param gen_params:
+    :return:
+    """
+    output = ""
+    is_function_call = False
+    has_send_first_chunk = False
+    for new_response in generate_stream_chatglm3(model, tokenizer, gen_params):
+        decoded_unicode = new_response["text"]
+        delta_text = decoded_unicode[len(output):]
+        output = decoded_unicode
+        # When it is not a function call and the character length is> 7,
+        # try to judge whether it is a function call according to the special function prefix
+        if not is_function_call and len(output) > 7:
+            # Determine whether a function is called
+            is_function_call = contains_custom_function(output)
+            if is_function_call:
+                continue
+            # Non-function call, direct stream output
+            finish_reason = new_response["finish_reason"]
+            # Send an empty string first to avoid truncation by subsequent next() operations.
+            if not has_send_first_chunk:
+                message = DeltaMessage(
+                    content="",
+                    role="assistant",
+                    function_call=None,
+                )
+                choice_data = ChatCompletionResponseStreamChoice(
+                    index=0,
+                    delta=message,
+                    finish_reason=finish_reason
+                )
+                chunk = ChatCompletionResponse(
+                    model=model_id,
+                    id="",
+                    choices=[choice_data],
+                    created=int(time.time()),
+                    object="chat.completion.chunk"
+                )
+                yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+            send_msg = delta_text if has_send_first_chunk else output
+            has_send_first_chunk = True
+            message = DeltaMessage(
+                content=send_msg,
+                role="assistant",
+                function_call=None,
+            )
+            choice_data = ChatCompletionResponseStreamChoice(
+                index=0,
+                delta=message,
+                finish_reason=finish_reason
+            )
+            chunk = ChatCompletionResponse(
+                model=model_id,
+                id="",
+                choices=[choice_data],
+                created=int(time.time()),
+                object="chat.completion.chunk"
+            )
+            yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+    if is_function_call:
+        yield output
+    else:
+        yield '[DONE]'
+async def parse_output_text(model_id: str, value: str):
+    """
+    Directly output the text content of value
+    :param model_id:
+    :param value:
+    :return:
+    """
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(role="assistant", content=value),
+        finish_reason=None
+    )
+    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
+    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(),
+        finish_reason="stop"
+    )
+    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
+    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+    yield '[DONE]'
+def contains_custom_function(value: str) -> bool:
+    """
+    Determine whether 'function_call' according to a special function prefix.
+    For example, the functions defined in "tools_using_demo/tool_register.py" are all "get_xxx" and start with "get_"
+    [Note] This is not a rigorous judgment method, only for reference.
+    :param value:
+    :return:
+    """
+    return value and 'get_' in value
+if __name__ == "__main__":
+    # Load LLM
    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)
-    if 'cuda' in DEVICE:  # AMD, NVIDIA GPU can use Half Precision
+    model = AutoModel.from_pretrained(MODEL_PATH,
-        model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).to(DEVICE).eval()
+                                      load_in_4bit=True,
-    else:  # CPU, Intel GPU and other GPU can use Float16 Precision Only
+                                      trust_remote_code=True)
-        model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).float().to(DEVICE).eval()
+    # load Embedding
+    # embedding_model = SentenceTransformer(EMBEDDING_PATH, device="cuda")
    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
--- a/Intel_device_demo/ipex_llm_cpu_demo/chatglm3_infer.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/chatglm3_infer.py
+import time
+from ipex_llm.transformers import AutoModel
+from transformers import AutoTokenizer
+CHATGLM_V3_PROMPT_FORMAT = "\n{prompt}\n"
+# Please specify the local path to the chatglm3-6b model
+model_path = "D:\AI\ChatGLM3\model/chatglm3-6b/"
+# Load the ChatGLM3-6B model and quantize it to INT4
+model = AutoModel.from_pretrained(model_path,
+                                  load_in_4bit=True,
+                                  trust_remote_code=True)
+# Load the tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_path,
+                                          trust_remote_code=True)
+# Prepare ChatGLM3 format prompt
+prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="Who are you?")
+# Encode the prompt
+input_ids = tokenizer.encode(prompt, return_tensors="pt")
+st = time.time()
+# Perform inference calculation and generate Tokens
+output = model.generate(input_ids,max_new_tokens=32)
+end = time.time()
+# Decode the generated Tokens and display them
+output_str = tokenizer.decode(output[0], skip_special_tokens=True)
+print(f'Inference time: {end-st} s')
+print('-'*20, 'Prompt', '-'*20)
+print(prompt)
+print('-'*20, 'Output', '-'*20)
+print(output_str)
--- a/Intel_device_demo/ipex_llm_cpu_demo/chatglm3_web_demo.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/chatglm3_web_demo.py
+"""
+This script creates an interactive web demo for the ChatGLM3-6B model using Gradio,
+a Python library for building quick and easy UI components for machine learning models.
+It's designed to showcase the capabilities of the ChatGLM3-6B model in a user-friendly interface,
+allowing users to interact with the model through a chat-like interface.
+Usage:
+- Run the script to start the Gradio web server.
+- Interact with the model by typing questions and receiving responses.
+Requirements:
+- Gradio (required for 4.13.0 and later, 3.x is not support now) should be installed.
+Note: The script includes a modification to the Chatbot's postprocess method to handle markdown to HTML conversion,
+ensuring that the chat interface displays formatted text correctly.
+"""
+import os
+import streamlit as st
+from ipex_llm.transformers import AutoModel
+from transformers import AutoTokenizer
+st.set_page_config(
+    page_title="ChatGLM3-6B+BigDL-LLM demo",
+    page_icon=":robot:",
+    layout="wide"
+)
+MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/chatglm3-6b')
+@st.cache_resource
+def get_model():
+    model = AutoModel.from_pretrained(MODEL_PATH,
+                                    load_in_4bit=True,
+                                    trust_remote_code=True)
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH,
+                                            trust_remote_code=True)
+    return tokenizer, model
+tokenizer, model = get_model()
+if "history" not in st.session_state:
+    st.session_state.history = []
+if "past_key_values" not in st.session_state:
+    st.session_state.past_key_values = None
+max_length = st.sidebar.slider("max_length", 0, 32768, 8192, step=1)
+top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01)
+temperature = st.sidebar.slider("temperature", 0.0, 1.0, 0.6, step=0.01)
+buttonClean = st.sidebar.button("clearing session history", key="clean")
+if buttonClean:
+    st.session_state.history = []
+    st.session_state.past_key_values = None
+    st.rerun()
+for i, message in enumerate(st.session_state.history):
+    if message["role"] == "user":
+        with st.chat_message(name="user", avatar="user"):
+            st.markdown(message["content"])
+    else:
+        with st.chat_message(name="assistant", avatar="assistant"):
+            st.markdown(message["content"])
+with st.chat_message(name="user", avatar="user"):
+    input_placeholder = st.empty()
+with st.chat_message(name="assistant", avatar="assistant"):
+    message_placeholder = st.empty()
+prompt_text = st.chat_input("please enter your question.")
+if prompt_text:
+    input_placeholder.markdown(prompt_text)
+    history = st.session_state.history
+    past_key_values = st.session_state.past_key_values
+    for response, history, past_key_values in model.stream_chat(
+        tokenizer,
+        prompt_text,
+        history,
+        past_key_values=past_key_values,
+        max_length=max_length,
+        top_p=top_p,
+        temperature=temperature,
+        return_past_key_values=True,
+    ):
+        message_placeholder.markdown(response)
+    st.session_state.history = history
+    st.session_state.past_key_values = past_key_values
\ No newline at end of file
--- a/Intel_device_demo/ipex_llm_cpu_demo/generate.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/generate.py
+import torch
+import time
+import argparse
+import numpy as np
+from ipex_llm.transformers import AutoModel
+from modelscope import AutoTokenizer
+from transformers import AutoTokenizer
+# you could tune the prompt based on your own model,
+# here the prompt tuning refers to https://github.com/THUDM/ChatGLM3/blob/main/PROMPT.md
+CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ModelScope ChatGLM3 model')
+    parser.add_argument('--repo-id-or-model-path', type=str, default="ZhipuAI/chatglm3-6b",
+                        help='The ModelScope repo id for the ChatGLM3 model to be downloaded'
+                             ', or the path to the ModelScope checkpoint folder')
+    parser.add_argument('--prompt', type=str, default="AI是什么？",
+                        help='Prompt to infer')
+    parser.add_argument('--n-predict', type=int, default=32,
+                        help='Max tokens to predict')
+    args = parser.parse_args()
+    model_path = args.repo_id_or_model_path
+    # Load model in 4 bit,
+    # which convert the relevant layers in the model into INT4 format
+    # It is important to set `model_hub='modelscope'`, otherwise model hub is default to be huggingface
+    model = AutoModel.from_pretrained(model_path,
+                                      load_in_4bit=True,
+                                      trust_remote_code=True,
+                                      model_hub='modelscope')
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_path,
+                                              trust_remote_code=True)
+    # Generate predicted tokens
+    with torch.inference_mode():
+        prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
+        input_ids = tokenizer.encode(prompt, return_tensors="pt")
+        st = time.time()
+        # if your selected model is capable of utilizing previous key/value attentions
+        # to enhance decoding speed, but has `"use_cache": false` in its model config,
+        # it is important to set `use_cache=True` explicitly in the `generate` function
+        # to obtain optimal performance with IPEX-LLM INT4 optimizations
+        output = model.generate(input_ids,
+                                max_new_tokens=args.n_predict)
+        end = time.time()
+        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
+        print(f'Inference time: {end - st} s')
+        print('-' * 20, 'Prompt', '-' * 20)
+        print(prompt)
+        print('-' * 20, 'Output', '-' * 20)
+        print(output_str)
\ No newline at end of file
--- a/Intel_device_demo/ipex_llm_cpu_demo/openai_api_request.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/openai_api_request.py
+"""
+This script is an example of using the OpenAI API to create various interactions with a ChatGLM3 model.
+It includes functions to:
+1. Conduct a basic chat session, asking about weather conditions in multiple cities.
+2. Initiate a simple chat in Chinese, asking the model to tell a short story.
+3. Retrieve and print embeddings for a given text input.
+Each function demonstrates a different aspect of the API's capabilities, showcasing how to make requests
+and handle responses.
+"""
+from openai import OpenAI
+import time
+base_url = "http://127.0.0.1:8000/v1/"
+client = OpenAI(api_key="EMPTY", base_url=base_url)
+def function_chat():
+    messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
+    tools = [
+        {
+            "type": "function",
+            "function": {
+                "name": "get_current_weather",
+                "description": "Get the current weather in a given location",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "location": {
+                            "type": "string",
+                            "description": "The city and state, e.g. San Francisco, CA",
+                        },
+                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
+                    },
+                    "required": ["location"],
+                },
+            },
+        }
+    ]
+    response = client.chat.completions.create(
+        model="chatglm3-6b",
+        messages=messages,
+        tools=tools,
+        tool_choice="auto",
+    )
+    if response:
+        content = response.choices[0].message.content
+        print(content)
+    else:
+        print("Error:", response.status_code)
+def simple_chat(use_stream=True):
+    messages = [
+        {
+            "role": "system",
+            "content": "You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's "
+                       "instructions carefully. Respond using markdown.",
+        },
+        {
+            "role": "user",
+            "content": "你好，请你用生动的话语给我讲一个小故事吧"
+        }
+    ]
+    response = client.chat.completions.create(
+        model="chatglm3-6b",
+        messages=messages,
+        stream=use_stream,
+        max_tokens=256,
+        temperature=0.8,
+        presence_penalty=1.1,
+        top_p=0.8)
+    if response:
+        if use_stream:
+            for chunk in response:
+                print(chunk.choices[0].delta.content)
+        else:
+            content = response.choices[0].message.content
+            print(content)
+    else:
+        print("Error:", response.status_code)
+if __name__ == "__main__":
+    simple_chat(use_stream=False)
+    simple_chat(use_stream=True)
--- a/Intel_device_demo/ipex_llm_cpu_demo/utils.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/utils.py
+import gc
+import json
+import torch
+from transformers import PreTrainedModel, PreTrainedTokenizer
+from transformers.generation.logits_process import LogitsProcessor
+from typing import Union, Tuple
+class InvalidScoreLogitsProcessor(LogitsProcessor):
+    def __call__(
+            self, input_ids: torch.LongTensor, scores: torch.FloatTensor
+    ) -> torch.FloatTensor:
+        if torch.isnan(scores).any() or torch.isinf(scores).any():
+            scores.zero_()
+            scores[..., 5] = 5e4
+        return scores
+def process_response(output: str, use_tool: bool = False) -> Union[str, dict]:
+    content = ""
+    for response in output.split("<|assistant|>"):
+        metadata, content = response.split("\n", maxsplit=1)
+        if not metadata.strip():
+            content = content.strip()
+            content = content.replace("[[训练时间]]", "2023年")
+        else:
+            if use_tool:
+                content = "\n".join(content.split("\n")[1:-1])
+                def tool_call(**kwargs):
+                    return kwargs
+                parameters = eval(content)
+                content = {
+                    "name": metadata.strip(),
+                    "arguments": json.dumps(parameters, ensure_ascii=False)
+                }
+            else:
+                content = {
+                    "name": metadata.strip(),
+                    "content": content
+                }
+    return content
+@torch.inference_mode()
+def generate_stream_chatglm3(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, params: dict):
+    messages = params["messages"]
+    tools = params["tools"]
+    temperature = float(params.get("temperature", 1.0))
+    repetition_penalty = float(params.get("repetition_penalty", 1.0))
+    top_p = float(params.get("top_p", 1.0))
+    max_new_tokens = int(params.get("max_tokens", 256))
+    echo = params.get("echo", True)
+    messages = process_chatglm_messages(messages, tools=tools)
+    query, role = messages[-1]["content"], messages[-1]["role"]
+    inputs = tokenizer.build_chat_input(query, history=messages[:-1], role=role)
+    inputs = inputs.to(model.device)
+    input_echo_len = len(inputs["input_ids"][0])
+    if input_echo_len >= model.config.seq_length:
+        print(f"Input length larger than {model.config.seq_length}")
+    eos_token_id = [
+        tokenizer.eos_token_id,
+        tokenizer.get_command("<|user|>"),
+    ]
+    gen_kwargs = {
+        "max_new_tokens": max_new_tokens,
+        "do_sample": True if temperature > 1e-5 else False,
+        "top_p": top_p,
+        "repetition_penalty": repetition_penalty,
+        "logits_processor": [InvalidScoreLogitsProcessor()],
+    }
+    if temperature > 1e-5:
+        gen_kwargs["temperature"] = temperature
+    total_len = 0
+    for total_ids in model.stream_generate(**inputs, eos_token_id=eos_token_id, **gen_kwargs):
+        total_ids = total_ids.tolist()[0]
+        total_len = len(total_ids)
+        if echo:
+            output_ids = total_ids[:-1]
+        else:
+            output_ids = total_ids[input_echo_len:-1]
+        response = tokenizer.decode(output_ids)
+        if response and response[-1] != "�":
+            response, stop_found = apply_stopping_strings(response, ["<|observation|>"])
+            yield {
+                "text": response,
+                "usage": {
+                    "prompt_tokens": input_echo_len,
+                    "completion_tokens": total_len - input_echo_len,
+                    "total_tokens": total_len,
+                },
+                "finish_reason": "function_call" if stop_found else None,
+            }
+            if stop_found:
+                break
+    # Only last stream result contains finish_reason, we set finish_reason as stop
+    ret = {
+        "text": response,
+        "usage": {
+            "prompt_tokens": input_echo_len,
+            "completion_tokens": total_len - input_echo_len,
+            "total_tokens": total_len,
+        },
+        "finish_reason": "stop",
+    }
+    yield ret
+    gc.collect()
+    torch.cuda.empty_cache()
+def process_chatglm_messages(messages, tools=None):
+    _messages = messages
+    messages = []
+    if tools:
+        messages.append(
+            {
+                "role": "system",
+                "content": "Answer the following questions as best as you can. You have access to the following tools:",
+                "tools": tools
+            }
+        )
+    for m in _messages:
+        role, content, func_call = m.role, m.content, m.function_call
+        if role == "function":
+            messages.append(
+                {
+                    "role": "observation",
+                    "content": content
+                }
+            )
+        elif role == "assistant" and func_call is not None:
+            for response in content.split("<|assistant|>"):
+                metadata, sub_content = response.split("\n", maxsplit=1)
+                messages.append(
+                    {
+                        "role": role,
+                        "metadata": metadata,
+                        "content": sub_content.strip()
+                    }
+                )
+        else:
+            messages.append({"role": role, "content": content})
+    return messages
+def generate_chatglm3(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, params: dict):
+    for response in generate_stream_chatglm3(model, tokenizer, params):
+        pass
+    return response
+def apply_stopping_strings(reply, stop_strings) -> Tuple[str, bool]:
+    stop_found = False
+    for string in stop_strings:
+        idx = reply.find(string)
+        if idx != -1:
+            reply = reply[:idx]
+            stop_found = True
+            break
+    if not stop_found:
+        # If something like "\nYo" is generated just before "\nYou: is completed, trim it
+        for string in stop_strings:
+            for j in range(len(string) - 1, 0, -1):
+                if reply[-j:] == string[:j]:
+                    reply = reply[:-j]
+                    break
+            else:
+                continue
+            break
+    return reply, stop_found
--- a/Intel_device_demo/openvino_demo/README.md
+++ b/Intel_device_demo/openvino_demo/README.md
+# 使用 OpenVINO 部署ChatGLM3-6B 模型
+[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) 是 Intel 为深度学习推理而设计的开源工具包。它可以帮助开发者优化模型，提高推理性能，减少模型的内存占用。本示例将展示如何使用 OpenVINO 部署 ChatGLM3。
+你需要克隆本仓库，然后按照以下步骤进行操作来将模型转换你的 OpenVINO IR 模型，随后进行推理。
+## 1. 环境配置
+首先，克隆OpenVINO GLM3 推理仓库并安装依赖。
+```bash
+git clone https://github.com/OpenVINO-dev-contest/chatglm3.openvino.git
+cd chatglm3.openvino
+```
+接着，我们推荐您新建一个虚拟环境，然后按照以下安装依赖。
+```
+python3 -m venv openvino_env
+source openvino_env/bin/activate
+python3 -m pip install --upgrade pip
+pip install wheel setuptools
+pip install -r requirements.txt
+```
+## 2. 转换模型
+由于需要将Huggingface模型转换为OpenVINO IR模型，因此您需要下载模型并转换。
+```
+python3 convert.py --model_id THUDM/chatglm3-6b --output {your_path}/chatglm3-6b 
+```
+### 可以选择的参数
+* `--model_id` - 模型所在目录的路径（绝对路径）。
+* `--output` - 转换后模型保存的地址
+## 3. 量化模型（非必须）
+```
+python3 quantize.py --model_path {your_path}/chatglm3-6b --precision int4 --output {your_path}/chatglm3-6b-int4
+```
+### 可以选择的参数
+* `--model_path` - OpenVINO IR 模型所在目录的路径。
+* `-- precision` - 量化精度：int8 或 int4。
+* `--output` - 保存模型的路径。
+## 4. 运行 ChatGLM3 模型
+```
+python3 chat.py --model_path {your_path}/chatglm3-6b --max_sequence_length 4096 --device CPU
+```
+### 可以选择的参数
+* `--model_path` - OpenVINO IR 模型所在目录的路径。
+* `--max_sequence_length` - 输出标记的最大大小。
+* `--device` - 运行推理的设备。
+## 例子
+```
+用户: 你好
+ChatGLM3-6B-OpenVINO: 你好！有什么我可以帮助你的吗？
+用户: 你是谁？     
+ChatGLM3-6B-OpenVINO: 我是一个名为ChatGLM3-6B的人工智能助手，是由清华大学KEG实验室和智谱AI 公司于2023 年共同训练的语言模型开发而成。我的任务是针对用户的问题和要求提供适当的答复和支持。
+用户: 请给我讲一个有趣的故事
+ChatGLM3-6B-OpenVINO: 从前，有一个名叫小明的小男孩，他是一个非常喜欢动物的人。有一天，他在森林里散步时，发现了一个非常漂亮的小鸟。小鸟受伤了，无法飞行。小明非常心疼，于是决定照顾这只小鸟。小明带着小鸟回家，为它搭建了一个小小的巢穴，并找来了一些软草和食物。每天，他都会给小鸟喂食，并为它换水。渐渐地，小鸟的伤势好了起来，开始在小明的家里飞来飞去，它们成了非常好的朋友。然而，一天，小明的父母告诉他，他们必须把小明养的小鸟送到森林里去。小明非常伤心，因为他已经和小鸟成为了好朋友。但是，他的父母告诉他，小鸟在森林里会更加自由自在，而且他也可以继续观看小鸟在森林中的生活。于是，小明和他的父母一起将小鸟送到了森林中。小鸟非常高兴，因为它又可以飞行了，并且还有许多其他的小动物朋友。小明也感到非常开心，因为他知道，即使不能一直拥有小鸟，他仍然可以欣赏到它们在自然中的美丽。从此以后，小明常常来到森林中，寻找小鸟。
+用户: 请给这个故事起一个标题
+ChatGLM3-6B-OpenVINO: 《友谊的力量：小明与小鸟的森林冒险》
+```
+## 常见问题
+1. 为什么倒入本地模型还会报 huggingface 链接错误
+   - 降级 transformers 库到 4.37.2 版本
+2. 需要安装 OpenVINO C++ 推理引擎吗
+   - 不需要
+3. 一定要使用 Intel 的硬件吗？
+   - 我们仅在 Intel 设备上尝试，我们推荐使用x86架构的英特尔设备，包括但不限制于：
+   - 英特尔的CPU，包括个人电脑CPU 和服务器CPU。
+   - 英特尔的独立显卡。例如：ARC A770 显卡。
\ No newline at end of file
--- a/Intel_device_demo/openvino_demo/openvino_cli_demo.py
+++ b/Intel_device_demo/openvino_demo/openvino_cli_demo.py
+import argparse
+from typing import List, Tuple
+from threading import Thread
+import torch
+from optimum.intel.openvino import OVModelForCausalLM
+from transformers import (AutoTokenizer, AutoConfig,
+                          TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria)
+def parse_text(text):
+    lines = text.split("\n")
+    lines = [line for line in lines if line != ""]
+    count = 0
+    for i, line in enumerate(lines):
+        if "```" in line:
+            count += 1
+            items = line.split('`')
+            if count % 2 == 1:
+                lines[i] = f'<pre><code class="language-{items[-1]}">'
+            else:
+                lines[i] = f'<br></code></pre>'
+        else:
+            if i > 0:
+                if count % 2 == 1:
+                    line = line.replace("`", "\`")
+                    line = line.replace("<", "&lt;")
+                    line = line.replace(">", "&gt;")
+                    line = line.replace(" ", "&nbsp;")
+                    line = line.replace("*", "&ast;")
+                    line = line.replace("_", "&lowbar;")
+                    line = line.replace("-", "&#45;")
+                    line = line.replace(".", "&#46;")
+                    line = line.replace("!", "&#33;")
+                    line = line.replace("(", "&#40;")
+                    line = line.replace(")", "&#41;")
+                    line = line.replace("$", "&#36;")
+                lines[i] = "<br>" + line
+    text = "".join(lines)
+    return text
+class StopOnTokens(StoppingCriteria):
+    def __init__(self, token_ids):
+        self.token_ids = token_ids
+    def __call__(
+            self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
+    ) -> bool:
+        for stop_id in self.token_ids:
+            if input_ids[0][-1] == stop_id:
+                return True
+        return False
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(add_help=False)
+    parser.add_argument('-h',
+                        '--help',
+                        action='help',
+                        help='Show this help message and exit.')
+    parser.add_argument('-m',
+                        '--model_path',
+                        required=True,
+                        type=str,
+                        help='Required. model path')
+    parser.add_argument('-l',
+                        '--max_sequence_length',
+                        default=256,
+                        required=False,
+                        type=int,
+                        help='Required. maximun length of output')
+    parser.add_argument('-d',
+                        '--device',
+                        default='CPU',
+                        required=False,
+                        type=str,
+                        help='Required. device for inference')
+    args = parser.parse_args()
+    model_dir = args.model_path
+    ov_config = {"PERFORMANCE_HINT": "LATENCY",
+                 "NUM_STREAMS": "1", "CACHE_DIR": ""}
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_dir, trust_remote_code=True)
+    print("====Compiling model====")
+    ov_model = OVModelForCausalLM.from_pretrained(
+        model_dir,
+        device=args.device,
+        ov_config=ov_config,
+        config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
+        trust_remote_code=True,
+    )
+    streamer = TextIteratorStreamer(
+        tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True
+    )
+    stop_tokens = [0, 2]
+    stop_tokens = [StopOnTokens(stop_tokens)]
+    def convert_history_to_token(history: List[Tuple[str, str]]):
+        messages = []
+        for idx, (user_msg, model_msg) in enumerate(history):
+            if idx == len(history) - 1 and not model_msg:
+                messages.append({"role": "user", "content": user_msg})
+                break
+            if user_msg:
+                messages.append({"role": "user", "content": user_msg})
+            if model_msg:
+                messages.append({"role": "assistant", "content": model_msg})
+        model_inputs = tokenizer.apply_chat_template(messages,
+                                                     add_generation_prompt=True,
+                                                     tokenize=True,
+                                                     return_tensors="pt")
+        return model_inputs
+    history = []
+    print("====Starting conversation====")
+    while True:
+        input_text = input("用户: ")
+        if input_text.lower() == 'stop':
+            break
+        if input_text.lower() == 'clear':
+            history = []
+            print("AI助手: 对话历史已清空")
+            continue
+        print("ChatGLM3-6B-OpenVINO:", end=" ")
+        history = history + [[parse_text(input_text), ""]]
+        model_inputs = convert_history_to_token(history)
+        generate_kwargs = dict(
+            input_ids=model_inputs,
+            max_new_tokens=args.max_sequence_length,
+            temperature=0.1,
+            do_sample=True,
+            top_p=1.0,
+            top_k=50,
+            repetition_penalty=1.1,
+            streamer=streamer,
+            stopping_criteria=StoppingCriteriaList(stop_tokens)
+        )
+        t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
+        t1.start()
+        partial_text = ""
+        for new_text in streamer:
+            new_text = new_text
+            print(new_text, end="", flush=True)
+            partial_text += new_text
+        print("\n")
+        history[-1][1] = partial_text
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2024 ChatGLM team @ Zhipu AI
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/PROMPT.md
+++ b/PROMPT.md
@@ -132,6 +132,7 @@ data[:5]
 <|assistant|>
 该文件看起来包含有关某些条目的元数据，每个条目有以下字段：
 - `file_name`: 文件名称
 - `name`: 名称
 - `type`: 类型（例如 "survivor" 或 "killer"）

--- a/PROMPT_en.md
+++ b/PROMPT_en.md
@@ -29,7 +29,7 @@ Where `<|role|>` part is represented in a special token,  which can’t be encod
 ### Example Scenarios
-For better readablity, an extra `\n` is added before each role special token. This extra `\n` should not be added in actual use and tokenizer implementation.
+For better readability, an extra `\n` is added before each role special token. This extra `\n` should not be added in actual use and tokenizer implementation.
 #### Multi-turn Dialogue
 * There are only three roles: `<|user|>`, `<|assistant|>`, and `<|system|>`.

--- a/README.md
+++ b/README.md
@@ -33,44 +33,43 @@ ChatGLM3-6B同样采用Transformer模型结构：
 ### Docker(方式一)
 推荐使用docker方式运行，提供拉取的docker镜像：
-```
+```bash
-docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
+docker pull image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py310
 ```
 进入docker，安装docker中没有的依赖:
-```
+```bash
-docker run -dit --network=host --name=chatglm3 --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:1.13.1-centos7.6-dtk-23.04-py38-latest
+docker run -dit --network=host --name=chatglm3 --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-centos7.6-dtk23.10.1-py310
 docker exec -it chatglm3 /bin/bash
 pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+cd finetune_demo
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
 ```
-### Dockerfile(方式二)
+### Conda（方法二）
-```
-docker build -t chatglm3:latest .
-docker run -dit --network=host --name=chatglm3 --privileged --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size=16G  --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -u root --ulimit stack=-1:-1 --ulimit memlock=-1:-1 chatglm3:latest
-docker exec -it chatglm3 /bin/bash
-```
-### Conda（方法三）
 1. 创建conda虚拟环境：
-```
-conda create -n chatglm python=3.8
+```bash
+conda create -n chatglm python=3.10
 ```
 2. 关于本项目DCU显卡所需的工具包、深度学习库等均可从[光合](https://developer.hpccube.com/tool/)开发者社区下载安装。
- [DTK 23.04](https://cancon.hpccube.com:65024/1/main/DTK-23.04.1)
- [Pytorch 1.13.1](https://cancon.hpccube.com:65024/4/main/pytorch/dtk23.04)
+- [DTK 23.10.1](https://cancon.hpccube.com:65024/1/main/DTK-23.10.1)
- [Deepspeed 0.9.2](https://cancon.hpccube.com:65024/4/main/deepspeed/dtk23.04)
+- [Pytorch 2.1](https://cancon.hpccube.com:65024/4/main/pytorch/previous_release/dtk23.10)
+- [Deepspeed 0.12.3](https://cancon.hpccube.com:65024/4/main/deepspeed/previous_release/dtk23.10)
    Tips：以上dtk驱动、python、deepspeed等工具版本需要严格一一对应。
 3. 其它依赖库参照requirements.txt安装：
-```
+```bash
-pip install -r requirements.txt
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
+cd finetune_demo
+pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
 ```
-### 注意 1
+### 注意
-```
+```python
 #到虚拟环境下对应的python/site-packages注释掉一些版本判断
 site-packages/accelerate/accelerator.py 文件
@@ -89,7 +88,7 @@ site-packages/transformers/utils/versions.py 文件
 ## 数据集
 单轮对话数据以[ADGEN](https://aclanthology.org/D19-1321.pdf) (广告生成) 数据集为例介绍代码的使用方法，该数据集任务为根据输入（content）生成一段广告词（summary），以下为下载地址：
 - [Google Drive](https://drive.google.com/file/d/13_vf0xRTQsyneRKdD1bZIr93vBGOczrk/view?usp=sharing) 或者 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/b3f119a008264b1cabd1/?dl=1)
-下载处理好的 ADGEN 数据集，将解压后的AdvertiseGen目录放到 [finetune_chatmodel_demo](./finetune_chatmodel_demo)目录下。数据集目录结构如下：
+下载处理好的 ADGEN 数据集，将解压后的AdvertiseGen目录放到 [finetune_demo/data](./finetune_demo/data)目录下。数据集目录结构如下：
 ```
 ── AdvertiseGen
    │   ├── dev.json
@@ -97,18 +96,10 @@ site-packages/transformers/utils/versions.py 文件
 ```
 通过以下方式将数据集处理成模型需要的格式:
 ```bash
-cd finetune_chatmodel_demo
+cd finetune_demo
-python ./scripts/format_advertise_gen.py --path "AdvertiseGen/train.json"
+python process.py
-```
-多轮对话及工具调用数据以[ToolAlpaca](https://github.com/tangqiaoyu/ToolAlpaca)数据集为例介绍代码的使用方法,下载数据集，并通过以下方式将数据集处理成模型需要的格式:
-```bash
-cd finetune_chatmodel_demo
-python ./scripts/format_tool_alpaca.py --path "train_data.json"
 ```
 ### 模型下载
 | Model | Seq Length |                                                              Download                                                               
 | :---: |:---------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------:
@@ -118,69 +109,74 @@ python ./scripts/format_tool_alpaca.py --path "train_data.json"
 ## 训练
-### P-tuning v2 微调训练
+### SFT微调
-本仓库实现了对于ChatGLM3-6B模型基于[P-Tuning v2](https://github.com/THUDM/P-tuning-v2)的微调。P-Tuning v2是由清华大学提出的一种高效参数微调方法。
 #### 单轮对话微调
+```bash
+cd ./finetune_demo
+bash sft.sh
 ```
-    cd ./finetune_chatmodel_demo/scripts
+注意：请根据自己的需求配置其中的模型路径、数据集路径；batchsize、学习率等参数在./finetune_demo/configs/sft.yaml；
-    bash finetune_pt.sh
-```
-注意：请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数；
-#### 多轮对话及工具能力微调
+#### 推理验证
-```
+对于输入输出格式的微调，可使用 `sft_inf.sh` 进行基本的推理验证。
-    cd ./finetune_chatmodel_demo/scripts
-    bash finetune_pt_multiturn.sh
-```
-注意：请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数；
-### 全参数微调
+在完成微调任务之后，我们可以查看到 `output` 文件夹下多了很多个`checkpoint-*`的文件夹，这些文件夹代表了训练的轮数。 我们选择最后一轮的微调权重，并使用inference进行导入。
-#### 单轮对话微调
+注意：此时要将hf上下载的原生`tokenizer_config.json` 和`tokenization_chatglm.py` 两个文件放入要待测的 checkpoint 文件夹下，比如./finetune_demo/output/checkpoint-3000/
-```
-    cd ./finetune_chatmodel_demo/scripts
-    bash finetune_ds.sh
-```
-注意：请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数；
-#### 多轮对话及工具能力微调
+```bash
-```
+cd ./finetune_demo
-    cd ./finetune_chatmodel_demo/scripts
+bash sft_inf.sh
-    bash finetune_ds_multiturn.sh
 ```
-注意：请根据自己的需求配置其中的模型路径、数据集路径、batchsize、学习率等参数；
-### 推理验证
-对于输入输出格式的微调，可使用 `inference.py` 进行基本的推理验证。
+### LORA微调
+#### 单轮对话微调
 ```bash
-python inference.py \
+cd ./fintune_demo
-    --pt-checkpoint "path to p-tuning checkpoint" \
+bash lora.sh
-    --model THUDM/chatglm3-6b 
 ```
+注意：请根据自己的需求配置其中的模型路径、数据集路径；batchsize、学习率等参数在 ./finetune_demo/configs/lora.yaml；
+#### 推理验证
+在完成微调任务之后，我们可以查看到 `output` 文件夹下多了很多个`checkpoint-*`的文件夹，这些文件夹代表了训练的轮数。 我们选择最后一轮的微调权重，并使用inference进行导入。
+注意：经过LORA微调训练后的checkpoint无需复制原生GLM3的tokenizer文件到其目录下。
 ```bash
-python inference.py \
+cd ./finetune_demo
-    --tokenizer THUDM/chatglm3-6b \
+bash lora_inf.sh
-    --model "path to finetuned model checkpoint" 
 ```
-## 推理
-运行如下命令：
-    python ./basic_demo/cli_demo.py
+## Result
-程序会在命令行中进行交互式的对话，在命令行中输入指示并回车即可生成回复，输入 clear 可以清空对话历史，输入 stop 终止程序。
+### SFT微调
+#### 单轮对话微调推理结果
+<div align="center">
+<img src="./media/result1.jpg">
+</div>
+### LORA微调
+#### 单轮对话微调推理结果
-## Result
- 推理效果如下：
 <div align="center">
-<img src="./media/cli.png" width="650" height="100">
+<img src="./media/result2.jpg">
 </div>
 ### 精度
 无

--- a/README_en.md
+++ b/README_en.md
@@ -10,15 +10,36 @@
 📍Experience the larger-scale ChatGLM model at <a href="https://www.chatglm.cn">chatglm.cn</a>
 </p>
-## Introduction
+📔
+About `ChatGLM3-6B`
+For more detailed usage information, please refer to: 
+ [ChatGLM3 technical documentation](https://lslfd0slxc.feishu.cn/wiki/WvQbwIJ9tiPAxGk8ywDck6yfnof?from=from_copylink)
+ [Bilibili video](https://www.bilibili.com/video/BV1uC4y1J7yA)
+ [YouTube video](https://www.youtube.com/watch?v=Pw9PB6R7ORA)
-ChatGLM3 is a new generation of pre-trained dialogue models jointly released by Zhipu AI and Tsinghua KEG. ChatGLM3-6B is the open-source model in the ChatGLM3 series, maintaining many excellent features of the first two generations such as smooth dialogue and low deployment threshold, while introducing the following features:
+## GLM-4 Introduction
+We have released the latest **GLM-4** model, which has made new breakthroughs in multiple indicators. You can directly experience our latest model in the following two channels.
+ [Chatglm Qingyan](https://www.chatglm.cn) To experience the latest version of GLM-4, including **GLM, all tools** and other functions, download the Zhipu Qingyan APP
+  Or use [web page](https://www.chatglm.cn).
+ [API Platform](https://open.bigmodel.cn/) The new generation API platform has been launched. You can directly access the API
+  Experience new models such as `GLM-4`, `GLM-3-Turbo`, `CharaterGLM-3`, and `CogView-3` on the platform.
+  Among them, two models, `GLM-4` and `GLM-3-Turbo`, support new functions such as `system prompt`, `function call`, `retrieval`, `Web_Search`, etc. Welcome to experience it.
+ [GLM4 API Open Source Tutorial](https://github.com/MetaGLM/glm-cookbook/) - A tutorial and basic application guide for the GLM-4 API. You are invited to explore and experiment.
+  For API-related inquiries, refer to this open-source tutorial, or utilize the [GLM-4 API AI Assistant](https://open.bigmodel.cn/shareapp/v1/?share_code=sQwt5qyqYVaNh1O_87p8O) for assistance with common questions.
+-----
+## ChatGLM3 Introduction
+**ChatGLM3** is a generation of pre-trained dialogue models jointly released by Zhipu AI and Tsinghua KEG. ChatGLM3-6B is the open-source model in the ChatGLM3 series, maintaining many excellent features of the first two generations such as smooth dialogue and low deployment threshold, while introducing the following features:
 1. **Stronger Base Model:** The base model of ChatGLM3-6B, ChatGLM3-6B-Base, adopts a more diverse training dataset, more sufficient training steps, and a more reasonable training strategy. Evaluations on datasets from various perspectives such as semantics, mathematics, reasoning, code, and knowledge show that **ChatGLM3-6B-Base has the strongest performance among base models below 10B**.
-2. **More Complete Function Support:** ChatGLM3-6B adopts a newly designed [Prompt format](PROMPT_en.md), supporting multi-turn dialogues as usual. It also natively supports [tool invocation](tool_using/README_en.md) (Function Call), code execution (Code Interpreter), and Agent tasks in complex scenarios.
+2. **More Complete Function Support:** ChatGLM3-6B adopts a newly designed [Prompt format](PROMPT_en.md), supporting multi-turn dialogues as usual. It also natively supports [tool invocation](tools_using_demo/README_en.md) (Function Call), code execution (Code Interpreter), and Agent tasks in complex scenarios.
-3. **More Comprehensive Open-source Series:** In addition to the dialogue model [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b), the basic model [ChatGLM3-6B-Base](https://huggingface.co/THUDM/chatglm3-6b-base), and the long-text dialogue model [ChatGLM3-6B-32K](https://huggingface.co/THUDM/chatglm3-6b-32k) have also been open-sourced. All these weights are **fully open** for academic research, and **free commercial use is also allowed** after registration via a [questionnaire](https://open.bigmodel.cn/mla/form).
+3. **More Comprehensive Open-source Series:** In addition to the dialogue model [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b), the basic model [ChatGLM3-6B-Base](https://huggingface.co/THUDM/chatglm3-6b-base), the long-text dialogue model [ChatGLM3-6B-32K](https://huggingface.co/THUDM/chatglm3-6b-32k) and further strengthens the ability to understand long texts [ChatGLM3-6B-128K](https://huggingface.co/THUDM/chatglm3-6b-128k) have also been open-sourced. All these weights are **fully open** for academic research, and **free commercial use is also allowed** after registration via a [questionnaire](https://open.bigmodel.cn/mla/form).
 -----
@@ -28,17 +49,31 @@ Although every effort has been made to ensure the compliance and accuracy of the
 ## Model List
-| Model | Seq Length |                                                              Download                                                               
+|      Model       | Seq Length |                                                                              Download                                                                              
-| :---: |:---------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------:
+|:----------------:|:----------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------:
-| ChatGLM3-6B | 8k |      [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b)      
+|   ChatGLM3-6B    |     8k     |                     [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b)                      
-| ChatGLM3-6B-Base | 8k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-base) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base) 
+| ChatGLM3-6B-Base |     8k     |                [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-base) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base)                 
-| ChatGLM3-6B-32K | 32k |                                   [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-32k) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k)                                    
+| ChatGLM3-6B-32K  |    32k     |                 [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-32k) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k)                  
+| ChatGLM3-6B-128K |    128k    |                 [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-128k) ｜ [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-128k)                 
 ## Projects
-Open source projects that accelerate ChatGLM3:
+The following excellent open source repositories have in-depth support for the ChatGLM3-6B model, and everyone is welcome to expand their learning.
+Inference acceleration:
 * [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): Real-time inference on your laptop accelerated by quantization, similar to llama.cpp.
 * [ChatGLM3-TPU](https://github.com/sophgo/ChatGLM3-TPU): Using the TPU accelerated inference solution, it runs about 7.5 token/s in real time on the end-side chip BM1684X (16T@FP16, 16G DDR).
+* [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main): A high-performance GPU-accelerated inference solution developed by NVIDIA, you can refer to these [steps](./tensorrt_llm_demo/README.md) to deploy ChatGLM3.
+* [OpenVINO](https://github.com/openvinotoolkit): A high-performance CPU and GPU accelerated inference solution developed by Intel, you can refer to this [step](./Intel_device_demo/openvino_demo/README.md) to deploy the ChatGLM3-6B model
+Efficient fine-tuning:
+* [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): An excellent, easy-to-use and efficient fine-tuning framework.
+Application framework:
+* [LangChain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat): Based on large language models such as ChatGLM and application frameworks such as Langchain, open source and offline deployable retrieval enhancement generation (RAG) large Model knowledge base project.
+* [BISHENG](https://github.com/dataelement/bisheng): open-source platform for developing LLM applications. It empowers and accelerates the development of LLM applications and helps users to enter the next generation of application development mode with the best experience.
 ## Evaluation Results
 ### Typical Tasks
@@ -75,10 +110,7 @@ Then use pip to install the dependencies:
 ```
 pip install -r requirements.txt
 ```
-+ The `transformers` library version should be `4.30.2` and above, and `torch` library should be 2.0 and above to obtain the best inference performance.
 + In order to ensure that the version of `torch` is correct, please strictly follow the instructions of [official documentation](https://pytorch.org/get-started/locally/) for installation.
-+ The `gradio` library version should be the `3.x` version.
 ### Integrated Demo
@@ -128,21 +160,21 @@ git clone https://huggingface.co/THUDM/chatglm3-6b
 If the download from HuggingFace is slow, you can also download it from [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b).
 # Model Fine-tuning
-Please refer to the dialog model fine-tuning [ChatGLM3-6B fine-tuning example](finetune_chatmodel_demo/README.md), or the base model fine-tuning [ChatGLM3-6B-base fine-tuning example](finetune_basemodel_demo/README.md).
-Please note that different fine-tuning scripts correspond to different models. Please select the corresponding model according to your needs.
+We provide a basic fine-tuning framework for ChatGLM3-6B. You can use it to fine-tune the model on your own dataset. For more details, please refer to [Fine-tuning Demo](finetune_demo/README_en.md).
 ### Web-based Dialogue Demo
 ![web-demo](resources/web-demo.gif)
 You can launch a web-based demo using Gradio with the following command:
 ```shell
-python web_demo.py
+python web_demo_gradio.py
 ```
 ![web-demo](resources/web-demo2.png)
 You can launch a web-based demo using Streamlit with the following command:
 ```shell
-streamlit run web_demo2.py
+streamlit run web_demo_streamlit.py
 ```
 The web-based demo will run a Web Server and output an address. You can use it by opening the output address in a browser. Based on tests, the web-based demo using Streamlit runs more smoothly.
@@ -159,19 +191,34 @@ python cli_demo.py
 The program will interact in the command line, enter instructions in the command line and hit enter to generate a response. Enter `clear` to clear the dialogue history, enter `stop` to terminate the program.
-### API Deployment
+### OpenAI API /Zhipu API Demo 
-Thanks to [@xusenlinzy](https://github.com/xusenlinzy) for implementing the OpenAI format streaming API deployment, which can serve as the backend for any ChatGPT-based application, such as [ChatGPT-Next-Web](https://github.com/Yidadaa/ChatGPT-Next-Web). You can deploy it by running [openai_api.py](openai_api_demo/openai_api.py) in the repository:
+We have launched open source model API deployment code in OpenAI / ZhipuAI format, which can be used as the backend of any ChatGPT-based application.
+Currently, you can deploy by running [api_server.py](openai_api_demo/api_server.py) in the warehouse
 ```shell
 cd openai_api_demo
-python openai_api.py
+python api_server.py
 ```
-Also, we have written a sample code to test the performance of the API calls. This can be tested by running [openai_api_request.py](openai_api_demo/openai_api_request.py) in the repository
+At the same time, we also wrote a sample code to test the performance of API calls.
+ OpenAI test script: [openai_api_request.py](openai_api_demo/openai_api_request.py)
+ ZhipuAI test script: [zhipu_api_request.py](openai_api_demo/zhipu_api_request.py)
 + Test with Curl
+ chat Curl test
+```shell
+curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d "{\"model\": \"chatglm3-6b\", \"messages\": [{\"role\": \"system\", \"content\": \"You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.\"}, {\"role\": \"user\", \"content\": \"你好，给我讲一个故事，大概100字\"}], \"stream\": false, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}"
+````
+ agent-chat Curl test
 ```shell
 curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \
-H "Content-Type: application/json" \\
+-H "Content-Type: application/json" \
-d "{\"model\": \"chatglm3-6b\", \"messages\": [{\"role\": \"system\", \"content\": \"You are ChatGLM3, a large language model trained by Zhipu. Follow the user's instructions carefully. Respond using markdown.\"}, {\"role\": \"user\", \"content\": \"Hello, tell me a story, about 100 words\"}], \"stream\": false, \"max_title": \"\". false, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}"
+-d "{\"model\": \"chatglm3-6b\", \"agent\": true, \"messages\": [{\"role\": \"user\", \"content\": \"37乘以8加7除2等于多少？\"}], \"stream\": true, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}"
 ````
 + Testing with Python
 ```shell
 cd openai_api_demo
@@ -181,7 +228,7 @@ If the test is successful, the model should return a story.
 ### Tool Invocation
-For methods of tool invocation, please refer to [Tool Invocation](tool_using/README_en.md).
+For methods of tool invocation, please refer to [Tool Invocation](tools_using_demo/README_en.md).
 ## Low-Cost Deployment
@@ -217,15 +264,18 @@ Loading the half-precision ChatGLM3-6B model requires about 13GB of memory. Mach
 ### Multi-GPU Deployment
-If you have multiple GPUs, but each GPU's VRAM size is not enough to accommodate the complete model, then the model can be split across multiple GPUs. First, install accelerate: `pip install accelerate`, and then load the model through the following methods:
+If you have multiple GPUs, but each GPU's VRAM size is not enough to accommodate the complete model, then the model can be split across multiple GPUs. First, install accelerate: `pip install accelerate`, and then load the model as usual.
-```python
-from utils import load_model_on_gpus
-model = load_model_on_gpus("THUDM/chatglm3-6b", num_gpus=2)
+### OpenVINO Demo
-```
+ChatGLM3-6B already supports the use of OpenVINO
+The toolkit accelerates inference and has a greater inference speed improvement on Intel's GPUs and GPU devices. For specific usage, please refer to [OpenVINO Demo](Intel_device_demo/openvino_demo/README.md).
+### TensorRT-LLM Demo
-This allows the model to be deployed on two GPUs for inference. You can change `num_gpus` to the number of GPUs you want to use. It is evenly split by default, but you can also pass the `device_map` parameter to specify it yourself.
+ChatGLM3-6B now supports accelerated inference using the TensorRT-LLM toolkit, significantly improving model inference speed. For specific usage, please refer to the [TensorRT-LLM Demo](tensorrt_llm_demo/tensorrt_llm_cli_demo.py) and the official technical documentation.
 ## Citation

--- a/README_old.md
+++ b/README_old.md
-# ChatGLM3
-<p align="center">
-🤗 <a href="https://huggingface.co/THUDM/chatglm3-6b" target="_blank">HF Repo</a> • 🤖 <a href="https://modelscope.cn/models/ZhipuAI/chatglm3-6b" target="_blank">ModelScope</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
-</p>
-<p align="center">
-    👋 加入我们的 <a href="https://join.slack.com/t/chatglm/shared_invite/zt-25ti5uohv-A_hs~am_D3Q8XPZMpj7wwQ" target="_blank">Slack</a> 和 <a href="resources/WECHAT.md" target="_blank">微信</a>
-</p>
-<p align="center">
-📍在 <a href="https://www.chatglm.cn">chatglm.cn</a> 体验更大规模的 ChatGLM 模型。
-</p>
-[Read this in English.](./README_en.md)
-📔 更为详细的使用信息，可以参考：[ChatGLM3技术文档](https://lslfd0slxc.feishu.cn/wiki/WvQbwIJ9tiPAxGk8ywDck6yfnof?from=from_copylink)
-## 介绍
-ChatGLM3 是智谱AI和清华大学 KEG 实验室联合发布的新一代对话预训练模型。ChatGLM3-6B 是 ChatGLM3 系列中的开源模型，在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上，ChatGLM3-6B 引入了如下特性：
-1. **更强大的基础模型：** ChatGLM3-6B 的基础模型 ChatGLM3-6B-Base 采用了更多样的训练数据、更充分的训练步数和更合理的训练策略。在语义、数学、推理、代码、知识等不同角度的数据集上测评显示，**ChatGLM3-6B-Base 具有在 10B 以下的基础模型中最强的性能**。
-2. **更完整的功能支持：** ChatGLM3-6B 采用了全新设计的 [Prompt 格式](PROMPT.md)，除正常的多轮对话外。同时原生支持[工具调用](tool_using/README.md)（Function Call）、代码执行（Code Interpreter）和 Agent 任务等复杂场景。
-3. **更全面的开源序列：** 除了对话模型 [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b) 外，还开源了基础模型 [ChatGLM3-6B-Base](https://huggingface.co/THUDM/chatglm3-6b-base)、长文本对话模型 [ChatGLM3-6B-32K](https://huggingface.co/THUDM/chatglm3-6b-32k)。以上所有权重对学术研究**完全开放**，在填写[问卷](https://open.bigmodel.cn/mla/form)进行登记后**亦允许免费商业使用**。
-----
-ChatGLM3 开源模型旨在与开源社区一起推动大模型技术发展，恳请开发者和大家遵守[开源协议](MODEL_LICENSE)，勿将开源模型和代码及基于开源项目产生的衍生物用于任何可能给国家和社会带来危害的用途以及用于任何未经过安全评估和备案的服务。目前，本项目团队未基于 **ChatGLM3 开源模型**开发任何应用，包括网页端、安卓、苹果 iOS 及 Windows App 等应用。
-尽管模型在训练的各个阶段都尽力确保数据的合规性和准确性，但由于 ChatGLM3-6B 模型规模较小，且模型受概率随机性因素影响，无法保证输出内容的准确。同时模型的输出容易被用户的输入误导。**本项目不承担开源模型和代码导致的数据安全、舆情风险或发生任何模型被误导、滥用、传播、不当利用而产生的风险和责任。**
-## 模型列表
-| Model | Seq Length |                                                              Download                                                               
-| :---: |:---------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------:
-| ChatGLM3-6B | 8k |      [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b)      
-| ChatGLM3-6B-Base | 8k | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-base) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base) 
-| ChatGLM3-6B-32K | 32k |                                   [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-32k) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k)                                    
-## 友情链接
-对 ChatGLM3 进行加速的开源项目：
-* [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): 类似 llama.cpp 的量化加速推理方案，实现笔记本上实时对话
-* [ChatGLM3-TPU](https://github.com/sophgo/ChatGLM3-TPU): 采用TPU加速推理方案，在算能端侧芯片BM1684X（16T@FP16，内存16G）上实时运行约7.5 token/s
-## 评测结果
-### 典型任务
-我们选取了 8 个中英文典型数据集，在 ChatGLM3-6B (base) 版本上进行了性能测试。
-| Model            | GSM8K | MATH | BBH  | MMLU | C-Eval | CMMLU | MBPP | AGIEval |
-|------------------|:-----:|:----:|:----:|:----:|:------:|:-----:|:----:|:-------:|
-| ChatGLM2-6B-Base | 32.4  | 6.5  | 33.7 | 47.9 |  51.7  | 50.0  |  -   |    -    |
-| Best Baseline    | 52.1  | 13.1 | 45.0 | 60.1 |  63.5  | 62.2  | 47.5 |  45.8   
-| ChatGLM3-6B-Base | 72.3  | 25.7 | 66.1 | 61.4 |  69.0  | 67.5  | 52.4 |  53.7   |
-> Best Baseline 指的是截止 2023年10月27日、模型参数在 10B 以下、在对应数据集上表现最好的预训练模型，不包括只针对某一项任务训练而未保持通用能力的模型。
-> 对 ChatGLM3-6B-Base 的测试中，BBH 采用 3-shot 测试，需要推理的 GSM8K、MATH 采用 0-shot CoT 测试，MBPP 采用 0-shot 生成后运行测例计算 Pass@1 ，其他选择题类型数据集均采用 0-shot 测试。
-我们在多个长文本应用场景下对 ChatGLM3-6B-32K 进行了人工评估测试。与二代模型相比，其效果平均提升了超过 50%。在论文阅读、文档摘要和财报分析等应用中，这种提升尤为显著。此外，我们还在 LongBench 评测集上对模型进行了测试，具体结果如下表所示
-| Model                |  平均 |  Summary | Single-Doc QA |  Multi-Doc QA | Code | Few-shot | Synthetic | 
-|----------------------|:-----:|:----:|:----:|:----:|:------:|:-----:|:-----:|
-| ChatGLM2-6B-32K   |  41.5 | 24.8 | 37.6 | 34.7 |  52.8  |  51.3 | 47.7 | 
-| ChatGLM3-6B-32K   |  50.2 | 26.6 | 45.8 | 46.1 |  56.2  |  61.2 | 65 |
-## 使用方式
-### 环境安装
-首先需要下载本仓库：
-```shell
-git clone https://github.com/THUDM/ChatGLM3
-cd ChatGLM3
-```
-然后使用 pip 安装依赖：
-```
-pip install -r requirements.txt
-```
-+ `transformers` 库版本应该 `4.30.2` 以及以上的版本 ，`torch` 库版本应为 2.0 及以上的版本，以获得最佳的推理性能。
-+ 为了保证 `torch` 的版本正确，请严格按照 [官方文档](https://pytorch.org/get-started/locally/) 的说明安装。
-+ `gradio` 库版本应该为 `3.x` 的版本。
-### 综合 Demo
-我们提供了一个集成以下三种功能的综合 Demo，运行方法请参考 [综合 Demo](composite_demo/README.md)
- Chat: 对话模式，在此模式下可以与模型进行对话。
- Tool: 工具模式，模型除了对话外，还可以通过工具进行其他操作。
-    <img src="resources/tool.png" width="400">
- Code Interpreter: 代码解释器模式，模型可以在一个 Jupyter 环境中执行代码并获取结果，以完成复杂任务。
-    <img src="resources/heart.png" width="400">
-### 代码调用 
-可以通过如下代码调用 ChatGLM 模型来生成对话：
-```python
->>> from transformers import AutoTokenizer, AutoModel
->>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
->>> model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True, device='cuda')
->>> model = model.eval()
->>> response, history = model.chat(tokenizer, "你好", history=[])
->>> print(response)
-你好👋!我是人工智能助手 ChatGLM3-6B,很高兴见到你,欢迎问我任何问题。
->>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
->>> print(response)
-晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法:
-1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。
-2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。
-3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。
-4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。
-5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。
-6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。
-如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。
-```
-#### 从本地加载模型
-以上代码会由 `transformers` 自动下载模型实现和参数。完整的模型实现在 [Hugging Face Hub](https://huggingface.co/THUDM/chatglm3-6b)。如果你的网络环境较差，下载模型参数可能会花费较长时间甚至失败。此时可以先将模型下载到本地，然后从本地加载。
-从 Hugging Face Hub 下载模型需要先[安装Git LFS](https://docs.github.com/zh/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)，然后运行
-```Shell
-git clone https://huggingface.co/THUDM/chatglm3-6b
-```
-如果从你从 HuggingFace 下载比较慢，也可以从 [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b) 
-中下载。
-### 模型微调
-请参考对话模型微调 [ChatGLM3-6B 微调示例](finetune_chatmodel_demo/README.md),或基座模型微调 [ChatGLM3-6B-base 微调示例](finetune_basemodel_demo/README.md)。
-请注意，不同的微调脚本对应的模型并不相同，请根据需要选择对应的模型。
-### 网页版对话 Demo
-![web-demo](resources/web-demo.gif)
-可以通过以下命令启动基于 Gradio 的网页版 demo：
-```shell
-python web_demo.py
-```
-![web-demo](resources/web-demo2.png)
-可以通过以下命令启动基于 Streamlit 的网页版 demo：
-```shell
-streamlit run web_demo2.py
-```
-网页版 demo 会运行一个 Web Server，并输出地址。在浏览器中打开输出的地址即可使用。 经测试，基于 Streamlit 的网页版 Demo 会更流畅。
-### 命令行对话 Demo
-![cli-demo](resources/cli-demo.png)
-运行仓库中 [cli_demo.py](basic_demo/cli_demo.py)：
-```shell
-python cli_demo.py
-```
-程序会在命令行中进行交互式的对话，在命令行中输入指示并回车即可生成回复，输入 `clear` 可以清空对话历史，输入 `stop` 终止程序。
-### LangChain Demo
-请参考 [基于 LangChain 的工具调用 Demo](langchain_demo/README.md)。
-### 工具调用
-关于工具调用的方法请参考 [工具调用](tool_using/README.md)。 
-### API 部署
-感谢 [@xusenlinzy](https://github.com/xusenlinzy) 实现了 OpenAI 格式的流式 API 部署，可以作为任意基于 ChatGPT 的应用的后端，比如 [ChatGPT-Next-Web](https://github.com/Yidadaa/ChatGPT-Next-Web)。可以通过运行仓库中的[openai_api.py](openai_api_demo/openai_api.py) 进行部署：
-```shell
-cd openai_api_demo
-python openai_api.py
-```
-同时，我们也书写了一个示例代码，用来测试API调用的性能。可以通过运行仓库中的[openai_api_request.py](openai_api_demo/openai_api_request.py) 进行测试
-+ 使用Curl进行测试
-```shell
-curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{\"model\": \"chatglm3-6b\", \"messages\": [{\"role\": \"system\", \"content\": \"You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.\"}, {\"role\": \"user\", \"content\": \"你好，给我讲一个故事，大概100字\"}], \"stream\": false, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}"
-````
-+ 使用Python进行测试
-```shell
-cd openai_api_demo
-python openai_api_request.py
-```
-如果测试成功，则模型应该返回一段故事。
-## 低成本部署
-### 模型量化
-默认情况下，模型以 FP16 精度加载，运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限，可以尝试以量化方式加载模型，使用方法如下：
-```python
-model = AutoModel.from_pretrained("THUDM/chatglm3-6b",trust_remote_code=True).quantize(4).cuda()
-```
-模型量化会带来一定的性能损失，经过测试，ChatGLM3-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。
-### CPU 部署
-如果你没有 GPU 硬件的话，也可以在 CPU 上进行推理，但是推理速度会更慢。使用方法如下（需要大概 32GB 内存）
-```python
-model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).float()
-```
-### Mac 部署
-对于搭载了 Apple Silicon 或者 AMD GPU 的 Mac，可以使用 MPS 后端来在 GPU 上运行 ChatGLM3-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly（正确的版本号应该是2.x.x.dev2023xxxx，而不是 2.x.x）。
-目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载，并使用 mps 后端：
-```python
-model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
-```
-加载半精度的 ChatGLM3-6B 模型需要大概 13GB 内存。内存较小的机器（比如 16GB 内存的 MacBook Pro），在空余内存不足的情况下会使用硬盘上的虚拟内存，导致推理速度严重变慢。
-### 多卡部署
-如果你有多张 GPU，但是每张 GPU 的显存大小都不足以容纳完整的模型，那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`，然后通过如下方法加载模型：
-```python
-from utils import load_model_on_gpus
-model = load_model_on_gpus("THUDM/chatglm3-6b", num_gpus=2)
-```
-即可将模型部署到两张 GPU 上进行推理。你可以将 `num_gpus` 改为你希望使用的 GPU 数。默认是均匀切分的，你也可以传入 `device_map` 参数来自己指定。 
-## 引用
-如果你觉得我们的工作有帮助的话，请考虑引用下列论文。
-```
-@article{zeng2022glm,
-  title={Glm-130b: An open bilingual pre-trained model},
-  author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
-  journal={arXiv preprint arXiv:2210.02414},
-  year={2022}
-}
-```
-```
-@inproceedings{du2022glm,
-  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
-  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
-  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
-  pages={320--335},
-  year={2022}
-}
-```