Update tools_using_demo/cli_demo_tool.py, tools_using_demo/openai_api_demo.py,...

Update tools_using_demo/cli_demo_tool.py, tools_using_demo/openai_api_demo.py, tools_using_demo/README.md, tools_using_demo/README_en.md, tools_using_demo/tool_register.py, tensorrt_llm_demo/README.md, tensorrt_llm_demo/tensorrt_llm_cli_demo.py, resources/cli-demo.png, resources/web-demo2.png, resources/tool_en.png, resources/tool.png, resources/heart.png, resources/wechat.jpg, resources/web-demo.gif, resources/web-demo2.gif, resources/WECHAT.md, resources/code_en.gif, openai_api_demo/api_server.py, openai_api_demo/.env, openai_api_demo/openai_api_request.py, openai_api_demo/docker-compose.yml, openai_api_demo/utils.py, openai_api_demo/zhipu_api_request.py, openai_api_demo/langchain_openai_api.py, langchain_demo/ChatGLM3.py, langchain_demo/main.py, langchain_demo/tools/Calculator.py, langchain_demo/tools/DistanceConversion.py, langchain_demo/tools/Weather.py, Intel_device_demo/README.md, Intel_device_demo/ipex_llm_cpu_demo/api_server.py, Intel_device_demo/ipex_llm_cpu_demo/chatglm3_infer.py, Intel_device_demo/ipex_llm_cpu_demo/chatglm3_web_demo.py, Intel_device_demo/ipex_llm_cpu_demo/openai_api_request.py, Intel_device_demo/ipex_llm_cpu_demo/generate.py, Intel_device_demo/ipex_llm_cpu_demo/utils.py, Intel_device_demo/openvino_demo/openvino_cli_demo.py, Intel_device_demo/openvino_demo/README.md, finetune_demo/lora_finetune.ipynb, finetune_demo/finetune_hf.py, finetune_demo/inference_hf.py, finetune_demo/README.md, finetune_demo/README_en.md, finetune_demo/requirements.txt, finetune_demo/configs/ds_zero_3.json, finetune_demo/configs/ds_zero_2.json, finetune_demo/configs/ptuning_v2.yaml, finetune_demo/configs/lora.yaml, finetune_demo/configs/sft.yaml, composite_demo/assets/emojis.png, composite_demo/assets/demo.png, composite_demo/assets/heart.png, composite_demo/assets/tool.png, composite_demo/.streamlit/config.toml, composite_demo/client.py, composite_demo/conversation.py, composite_demo/README_en.md, composite_demo/main.py, composite_demo/demo_chat.py, composite_demo/README.md, composite_demo/requirements.txt, composite_demo/demo_tool.py, composite_demo/tool_registry.py, composite_demo/demo_ci.py, basic_demo/cli_demo_bad_word_ids.py, basic_demo/cli_demo.py, basic_demo/cli_batch_request_demo.py, basic_demo/web_demo_gradio.py, basic_demo/web_demo_streamlit.py, .github/ISSUE_TEMPLATE/bug_report.yaml, .github/ISSUE_TEMPLATE/feature-request.yaml, .github/PULL_REQUEST_TEMPLATE/pr_template.md, MODEL_LICENSE, .gitignore, DEPLOYMENT.md, DEPLOYMENT_en.md, LICENSE, PROMPT.md, README_en.md, requirements.txt, README.md, PROMPT_en.md, update_requirements.sh files

Update tools_using_demo/cli_demo_tool.py, tools_using_demo/openai_api_demo.py,...
Update tools_using_demo/cli_demo_tool.py, tools_using_demo/openai_api_demo.py, tools_using_demo/README.md, tools_using_demo/README_en.md, tools_using_demo/tool_register.py, tensorrt_llm_demo/README.md, tensorrt_llm_demo/tensorrt_llm_cli_demo.py, resources/cli-demo.png, resources/web-demo2.png, resources/tool_en.png, resources/tool.png, resources/heart.png, resources/wechat.jpg, resources/web-demo.gif, resources/web-demo2.gif, resources/WECHAT.md, resources/code_en.gif, openai_api_demo/api_server.py, openai_api_demo/.env, openai_api_demo/openai_api_request.py, openai_api_demo/docker-compose.yml, openai_api_demo/utils.py, openai_api_demo/zhipu_api_request.py, openai_api_demo/langchain_openai_api.py, langchain_demo/ChatGLM3.py, langchain_demo/main.py, langchain_demo/tools/Calculator.py, langchain_demo/tools/DistanceConversion.py, langchain_demo/tools/Weather.py, Intel_device_demo/README.md, Intel_device_demo/ipex_llm_cpu_demo/api_server.py, Intel_device_demo/ipex_llm_cpu_demo/chatglm3_infer.py, Intel_device_demo/ipex_llm_cpu_demo/chatglm3_web_demo.py, Intel_device_demo/ipex_llm_cpu_demo/openai_api_request.py, Intel_device_demo/ipex_llm_cpu_demo/generate.py, Intel_device_demo/ipex_llm_cpu_demo/utils.py, Intel_device_demo/openvino_demo/openvino_cli_demo.py, Intel_device_demo/openvino_demo/README.md, finetune_demo/lora_finetune.ipynb, finetune_demo/finetune_hf.py, finetune_demo/inference_hf.py, finetune_demo/README.md, finetune_demo/README_en.md, finetune_demo/requirements.txt, finetune_demo/configs/ds_zero_3.json, finetune_demo/configs/ds_zero_2.json, finetune_demo/configs/ptuning_v2.yaml, finetune_demo/configs/lora.yaml, finetune_demo/configs/sft.yaml, composite_demo/assets/emojis.png, composite_demo/assets/demo.png, composite_demo/assets/heart.png, composite_demo/assets/tool.png, composite_demo/.streamlit/config.toml, composite_demo/client.py, composite_demo/conversation.py, composite_demo/README_en.md, composite_demo/main.py, composite_demo/demo_chat.py, composite_demo/README.md, composite_demo/requirements.txt, composite_demo/demo_tool.py, composite_demo/tool_registry.py, composite_demo/demo_ci.py, basic_demo/cli_demo_bad_word_ids.py, basic_demo/cli_demo.py, basic_demo/cli_batch_request_demo.py, basic_demo/web_demo_gradio.py, basic_demo/web_demo_streamlit.py, .github/ISSUE_TEMPLATE/bug_report.yaml, .github/ISSUE_TEMPLATE/feature-request.yaml, .github/PULL_REQUEST_TEMPLATE/pr_template.md, MODEL_LICENSE, .gitignore, DEPLOYMENT.md, DEPLOYMENT_en.md, LICENSE, PROMPT.md, README_en.md, requirements.txt, README.md, PROMPT_en.md, update_requirements.sh files
4bd96acc · lvzhen · d0572507 · 4bd96acc · 4bd96acc · 4bd96acc
Commit 4bd96acc authored May 10, 2024 by lvzhen
20 changed files
--- a/.github/ISSUE_TEMPLATE/bug_report.yaml
+++ b/.github/ISSUE_TEMPLATE/bug_report.yaml
+name: "\U0001F41B Bug Report"
+description: Submit a bug report to help us improve ChatGLM3 / 提交一个 Bug 问题报告来帮助我们改进 ChatGLM3
+body:
+  - type: textarea
+    id: system-info
+    attributes:
+      label: System Info / 系統信息
+      description: Your operating environment / 您的运行环境信息
+      placeholder: Includes Cuda version, Transformers version, Python version, operating system, hardware information (if you suspect a hardware problem)... / 包括Cuda版本，Transformers版本，Python版本，操作系统，硬件信息(如果您怀疑是硬件方面的问题)...
+    validations:
+      required: true
+
+  - type: textarea
+    id: who-can-help
+    attributes:
+      label: Who can help? / 谁可以帮助到您？
+      description: |
+        Your issue will be replied to more quickly if you can figure out the right person to tag with @
+        All issues are read by one of the maintainers, so if you don't know who to tag, just leave this blank and our maintainer will ping the right person.
+    
+        Please tag fewer than 3 people.
+        
+        如果您能找到合适的标签 @，您的问题会更快得到回复。
+        所有问题都会由我们的维护者阅读，如果您不知道该标记谁，只需留空，我们的维护人员会找到合适的开发组成员来解决问题。
+        
+        标记的人数应该不超过 3 个人。
+
+        Related demo leader / 相关demo负责人 :
+        - finetune_demo: @Btlmd
+        - langchain_demo: @yincf
+        - composite_demo: @abmfy
+        
+        If it's not a bug in these three subsections, you may not specify the helper. Our maintainer will find the right person in the development group to solve the problem.
+        
+        如果不是这三个子版块的bug，您可以不指明帮助者，我们的维护人员会找到合适的开发组成员来解决问题。
+
+      placeholder: "@Username ..."
+
+  - type: checkboxes
+    id: information-scripts-examples
+    attributes:
+      label: Information / 问题信息
+      description: 'The problem arises when using: / 问题出现在'
+      options:
+        - label: "The official example scripts / 官方的示例脚本"
+        - label: "My own modified scripts / 我自己修改的脚本和任务"
+
+  - type: textarea
+    id: reproduction
+    validations:
+      required: true
+    attributes:
+      label: Reproduction / 复现过程
+      description: |
+        Please provide a code example that reproduces the problem you encountered, preferably with a minimal reproduction unit.
+        If you have code snippets, error messages, stack traces, please provide them here as well.
+        Please format your code correctly using code tags. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        Do not use screenshots, as they are difficult to read and (more importantly) do not allow others to copy and paste your code.
+        
+        请提供能重现您遇到的问题的代码示例,最好是最小复现单元。
+        如果您有代码片段、错误信息、堆栈跟踪，也请在此提供。
+        请使用代码标签正确格式化您的代码。请参见 https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
+        请勿使用截图，因为截图难以阅读，而且（更重要的是）不允许他人复制粘贴您的代码。
+      placeholder: |
+        Steps to reproduce the behavior/复现Bug的步骤:
+          
+          1.
+          2.
+          3.
+
+  - type: textarea
+    id: expected-behavior
+    validations:
+      required: true
+    attributes:
+      label: Expected behavior / 期待表现
+      description: "A clear and concise description of what you would expect to happen. /简单描述您期望发生的事情。"
\ No newline at end of file
--- a/.github/ISSUE_TEMPLATE/feature-request.yaml
+++ b/.github/ISSUE_TEMPLATE/feature-request.yaml
+name: "\U0001F680 Feature request"
+description: Submit a request for a new ChatGLM3 feature / 提交一个新的 ChatGLM3 的功能建议
+labels: [ "feature" ]
+body:
+  - type: textarea
+    id: feature-request
+    validations:
+      required: true
+    attributes:
+      label: Feature request  / 功能建议
+      description: |
+        A brief description of the functional proposal. Links to corresponding papers and code are desirable.
+        对功能建议的简述。最好提供对应的论文和代码链接
+
+  - type: textarea
+    id: motivation
+    validations:
+      required: true
+    attributes:
+      label: Motivation / 动机
+      description: |
+        Your motivation for making the suggestion. If that motivation is related to another GitHub issue, link to it here.
+        您提出建议的动机。如果该动机与另一个 GitHub 问题有关，请在此处提供对应的链接。
+
+  - type: textarea
+    id: contribution
+    validations:
+      required: true
+    attributes:
+      label: Your contribution / 您的贡献
+      description: |
+        
+        Your PR link or any other link you can help with.
+        您的PR链接或者其他您能提供帮助的链接。
\ No newline at end of file
--- a/.github/PULL_REQUEST_TEMPLATE/pr_template.md
+++ b/.github/PULL_REQUEST_TEMPLATE/pr_template.md
+#  Raise valuable PR / 提出有价值的PR
+
+## Caution/ 注意事项:
+Users should keep the following points in mind when submitting PRs:
+
+1. The proposed PR should be about this project. 
+2. the proposed PR should be relevant, if there are multiple ideas and optimizations, they should be assigned to different PRs.
+
+用户在提交PR时候应该注意以下几点:
+
+1. 提出的PR应该是关于本项目的。
+2. 提出的PR应该具有针对性，如果具有多个不同的想法和优化方案，应该分配到不同的PR中。
+
+## 不应该提出的PR / PRs that should not be proposed
+
+If a developer proposes a PR about any of the following, it may be closed or Rejected.
+
+1. those that don't describe improvement options.
+2. multiple issues of different types combined in one PR.
+3. The proposed PR is highly duplicative of already existing PRs.
+
+如果开发者提出关于以下方面的PR，则可能会被直接关闭或拒绝通过。
+
+1. 没有说明改进方案的。
+2. 多个不同类型的问题合并在一个PR中的。
+3. 提出的PR与已经存在的PR高度重复的。
+
+
+# 检查您的PR
+- [ ] Have you read the Contributor Guidelines, Pull Request section? / 您是否阅读了贡献者指南、Pull Request 部分？
+- [ ] Has this been discussed/approved via a Github issue or forum? If so, add a link. / 是否通过 Github 问题或论坛讨论/批准过？如果是，请添加链接。
+- [ ] Did you make sure you updated the documentation with your changes? Here are the Documentation Guidelines, and here are the Documentation Formatting Tips. /您是否确保根据您的更改更新了文档？这里是文档指南，这里是文档格式化技巧。
+- [ ] Did you write new required tests? / 您是否编写了新的必要测试？
+- [ ]  Are your PRs for only one issue / 您的PR是否仅针对一个问题
\ No newline at end of file
--- a/.gitignore
+++ b/.gitignore
+__pycache__
+
+# finetune_demo: generated & downloaded files
+finetune_demo/output
+finetune_demo/data
+finetune_demo/formatted_data
+ToolAlpaca/
+AdvertiseGen/
+*.gz
+*.idea
+.DS_Store
\ No newline at end of file
--- a/DEPLOYMENT.md
+++ b/DEPLOYMENT.md
+## 低成本部署
+
+### 模型量化
+
+默认情况下，模型以 FP16 精度加载，运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限，可以尝试以量化方式加载模型，使用方法如下：
+
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm3-6b",trust_remote_code=True).quantize(4).cuda()
+```
+
+模型量化会带来一定的性能损失，经过测试，ChatGLM3-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。
+
+### CPU 部署
+
+如果你没有 GPU 硬件的话，也可以在 CPU 上进行推理，但是推理速度会更慢。使用方法如下（需要大概 32GB 内存）
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).float()
+```
+
+### Mac 部署
+
+对于搭载了 Apple Silicon 或者 AMD GPU 的 Mac，可以使用 MPS 后端来在 GPU 上运行 ChatGLM3-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly（正确的版本号应该是2.x.x.dev2023xxxx，而不是 2.x.x）。
+
+目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载，并使用 mps 后端：
+```python
+model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
+```
+
+加载半精度的 ChatGLM3-6B 模型需要大概 13GB 内存。内存较小的机器（比如 16GB 内存的 MacBook Pro），在空余内存不足的情况下会使用硬盘上的虚拟内存，导致推理速度严重变慢。
+
+### 多卡部署
+如果你有多张 GPU，但是每张 GPU 的显存大小都不足以容纳完整的模型，那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`，然后通过如下方法加载模型：
+
+```python
+from utils import load_model_on_gpus
+
+model = load_model_on_gpus("THUDM/chatglm3-6b", num_gpus=2)
+```
+即可将模型部署到两张 GPU 上进行推理。你可以将 `num_gpus` 改为你希望使用的 GPU 数。默认是均匀切分的，你也可以传入 `device_map` 参数来自己指定。 
\ No newline at end of file
--- a/DEPLOYMENT_en.md
+++ b/DEPLOYMENT_en.md
+## Low-Cost Deployment
+
+### Model Quantization
+
+By default, the model is loaded with FP16 precision, running the above code requires about 13GB of VRAM. If your GPU's VRAM is limited, you can try loading the model quantitatively, as follows:
+
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm3-6b",trust_remote_code=True).quantize(4).cuda()
+```
+
+Model quantization will bring some performance loss. Through testing, ChatGLM3-6B can still perform natural and smooth generation under 4-bit quantization.
+
+### CPU Deployment
+
+If you don't have GPU hardware, you can also run inference on the CPU, but the inference speed will be slower. The usage is as follows (requires about 32GB of memory):
+
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).float()
+```
+
+### Mac Deployment
+
+For Macs equipped with Apple Silicon or AMD GPUs, the MPS backend can be used to run ChatGLM3-6B on the GPU. Refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly (the correct version number should be 2.x.x.dev2023xxxx, not 2.x.x).
+
+Currently, only [loading the model locally](README_en.md#load-model-locally) is supported on MacOS. Change the model loading in the code to load locally and use the MPS backend:
+
+```python
+model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
+```
+
+Loading the half-precision ChatGLM3-6B model requires about 13GB of memory. Machines with smaller memory (such as a 16GB memory MacBook Pro) will use virtual memory on the hard disk when there is insufficient free memory, resulting in a significant slowdown in inference speed.
+
+### Multi-GPU Deployment
+
+If you have multiple GPUs, but each GPU's VRAM size is not enough to accommodate the complete model, then the model can be split across multiple GPUs. First, install accelerate: `pip install accelerate`, and then load the model through the following methods:
+
+```python
+from utils import load_model_on_gpus
+
+model = load_model_on_gpus("THUDM/chatglm3-6b", num_gpus=2)
+```
+
+This allows the model to be deployed on two GPUs for inference. You can change `num_gpus` to the number of GPUs you want to use. It is evenly split by default, but you can also pass the `device_map` parameter to specify it yourself.
\ No newline at end of file
--- a/Intel_device_demo/README.md
+++ b/Intel_device_demo/README.md
+# Intel Device Demo
+
+本文件夹主要辅助开发者 在 Intel 设备上加速部署 ChatGLM3-6B 模型。
+
+## 1. 硬件准备
+本文件夹中的设备支持列表包括：
+- Intel CPU 系列, 包含个人CPU 和 服务器 / 工作站 CPU
+- Intel Arc 独立显卡系列，包括 Arc A770 等显卡。
+- Intel CPU 核显系列
+- 其他理论支持 OpenVINO 加速的Intel 工具套件。
+
+## 2. 文件目录
+- IPEX_llm_xxx_demo: IPEX-LLM 是一个为Intel XPU(Xeon/Core/Flex/Arc/PVC)打造的低精度轻量级大语言模型库，在Intel平台上具有广泛的模型支持、最低的延迟和最小的内存占用，实现加速模型部署示例。
+- OpenVINO_demo: 使用 Intel OpenVINO 推理加速框架，实现加速模型部署示例。
+- Pytorch_demo (暂未推出) : 使用 Intel Pytorch Extension 实现在 Pytorch 环境上开发（适用于 Intel Arc 系列 GPU）
+
--- a/Intel_device_demo/ipex_llm_cpu_demo/api_server.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/api_server.py
+"""
+This script implements an API for the ChatGLM3-6B model,
+formatted similarly to OpenAI's API (https://platform.openai.com/docs/api-reference/chat).
+It's designed to be run as a web server using FastAPI and uvicorn,
+making the ChatGLM3-6B model accessible through OpenAI Client.
+
+Key Components and Features:
+- Model and Tokenizer Setup: Configures the model and tokenizer paths and loads them.
+- FastAPI Configuration: Sets up a FastAPI application with CORS middleware for handling cross-origin requests.
+- API Endpoints:
+  - "/v1/models": Lists the available models, specifically ChatGLM3-6B.
+  - "/v1/chat/completions": Processes chat completion requests with options for streaming and regular responses.
+  - "/v1/embeddings": Processes Embedding request of a list of text inputs.
+- Token Limit Caution: In the OpenAI API, 'max_tokens' is equivalent to HuggingFace's 'max_new_tokens', not 'max_length'.
+For instance, setting 'max_tokens' to 8192 for a 6b model would result in an error due to the model's inability to output
+that many tokens after accounting for the history and prompt tokens.
+- Stream Handling and Custom Functions: Manages streaming responses and custom function calls within chat responses.
+- Pydantic Models: Defines structured models for requests and responses, enhancing API documentation and type safety.
+- Main Execution: Initializes the model and tokenizer, and starts the FastAPI app on the designated host and port.
+
+Note:
+    This script doesn't include the setup for special tokens or multi-GPU support by default.
+    Users need to configure their special tokens and can enable multi-GPU support as per the provided instructions.
+    Embedding Models only support in One GPU.
+
+"""
+
+import os
+import time
+import tiktoken
+import torch
+import uvicorn
+
+from fastapi import FastAPI, HTTPException, Response
+from fastapi.middleware.cors import CORSMiddleware
+
+from contextlib import asynccontextmanager
+from typing import List, Literal, Optional, Union
+from loguru import logger
+from pydantic import BaseModel, Field
+from ipex_llm.transformers import AutoModel
+from transformers import AutoTokenizer
+from utils import process_response, generate_chatglm3, generate_stream_chatglm3
+# from sentence_transformers import SentenceTransformer
+
+from sse_starlette.sse import EventSourceResponse
+
+# Set up limit request time
+EventSourceResponse.DEFAULT_PING_INTERVAL = 1000
+
+# set LLM path
+MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/chatglm3-6b')
+TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", MODEL_PATH)
+
+# set Embedding Model path
+EMBEDDING_PATH = os.environ.get('EMBEDDING_PATH', 'BAAI/bge-large-zh-v1.5')
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    yield
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.ipc_collect()
+
+
+app = FastAPI(lifespan=lifespan)
+
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+
+
+class ModelCard(BaseModel):
+    id: str
+    object: str = "model"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    owned_by: str = "owner"
+    root: Optional[str] = None
+    parent: Optional[str] = None
+    permission: Optional[list] = None
+
+
+class ModelList(BaseModel):
+    object: str = "list"
+    data: List[ModelCard] = []
+
+
+class FunctionCallResponse(BaseModel):
+    name: Optional[str] = None
+    arguments: Optional[str] = None
+
+
+class ChatMessage(BaseModel):
+    role: Literal["user", "assistant", "system", "function"]
+    content: str = None
+    name: Optional[str] = None
+    function_call: Optional[FunctionCallResponse] = None
+
+
+class DeltaMessage(BaseModel):
+    role: Optional[Literal["user", "assistant", "system"]] = None
+    content: Optional[str] = None
+    function_call: Optional[FunctionCallResponse] = None
+
+
+## for Embedding
+class EmbeddingRequest(BaseModel):
+    input: List[str]
+    model: str
+
+
+class CompletionUsage(BaseModel):
+    prompt_tokens: int
+    completion_tokens: int
+    total_tokens: int
+
+
+class EmbeddingResponse(BaseModel):
+    data: list
+    model: str
+    object: str
+    usage: CompletionUsage
+
+
+# for ChatCompletionRequest
+
+class UsageInfo(BaseModel):
+    prompt_tokens: int = 0
+    total_tokens: int = 0
+    completion_tokens: Optional[int] = 0
+
+
+class ChatCompletionRequest(BaseModel):
+    model: str
+    messages: List[ChatMessage]
+    temperature: Optional[float] = 0.8
+    top_p: Optional[float] = 0.8
+    max_tokens: Optional[int] = None
+    stream: Optional[bool] = False
+    tools: Optional[Union[dict, List[dict]]] = None
+    repetition_penalty: Optional[float] = 1.1
+
+
+class ChatCompletionResponseChoice(BaseModel):
+    index: int
+    message: ChatMessage
+    finish_reason: Literal["stop", "length", "function_call"]
+
+
+class ChatCompletionResponseStreamChoice(BaseModel):
+    delta: DeltaMessage
+    finish_reason: Optional[Literal["stop", "length", "function_call"]]
+    index: int
+
+
+class ChatCompletionResponse(BaseModel):
+    model: str
+    id: str
+    object: Literal["chat.completion", "chat.completion.chunk"]
+    choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
+    created: Optional[int] = Field(default_factory=lambda: int(time.time()))
+    usage: Optional[UsageInfo] = None
+
+
+@app.get("/health")
+async def health() -> Response:
+    """Health check."""
+    return Response(status_code=200)
+
+
+@app.post("/v1/embeddings", response_model=EmbeddingResponse)
+async def get_embeddings(request: EmbeddingRequest):
+    embeddings = [embedding_model.encode(text) for text in request.input]
+    embeddings = [embedding.tolist() for embedding in embeddings]
+
+    def num_tokens_from_string(string: str) -> int:
+        """
+        Returns the number of tokens in a text string.
+        use cl100k_base tokenizer
+        """
+        encoding = tiktoken.get_encoding('cl100k_base')
+        num_tokens = len(encoding.encode(string))
+        return num_tokens
+
+    response = {
+        "data": [
+            {
+                "object": "embedding",
+                "embedding": embedding,
+                "index": index
+            }
+            for index, embedding in enumerate(embeddings)
+        ],
+        "model": request.model,
+        "object": "list",
+        "usage": CompletionUsage(
+            prompt_tokens=sum(len(text.split()) for text in request.input),
+            completion_tokens=0,
+            total_tokens=sum(num_tokens_from_string(text) for text in request.input),
+        )
+    }
+    return response
+
+
+@app.get("/v1/models", response_model=ModelList)
+async def list_models():
+    model_card = ModelCard(
+        id="chatglm3-6b"
+    )
+    return ModelList(
+        data=[model_card]
+    )
+
+
+@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
+async def create_chat_completion(request: ChatCompletionRequest):
+    global model, tokenizer
+
+    if len(request.messages) < 1 or request.messages[-1].role == "assistant":
+        raise HTTPException(status_code=400, detail="Invalid request")
+
+    gen_params = dict(
+        messages=request.messages,
+        temperature=request.temperature,
+        top_p=request.top_p,
+        max_tokens=request.max_tokens or 1024,
+        echo=False,
+        stream=request.stream,
+        repetition_penalty=request.repetition_penalty,
+        tools=request.tools,
+    )
+    logger.debug(f"==== request ====\n{gen_params}")
+
+    if request.stream:
+
+        # Use the stream mode to read the first few characters, if it is not a function call, direct stram output
+        predict_stream_generator = predict_stream(request.model, gen_params)
+        output = next(predict_stream_generator)
+        if not contains_custom_function(output):
+            return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")
+
+        # Obtain the result directly at one time and determine whether tools needs to be called.
+        logger.debug(f"First result output：\n{output}")
+
+        function_call = None
+        if output and request.tools:
+            try:
+                function_call = process_response(output, use_tool=True)
+            except:
+                logger.warning("Failed to parse tool call")
+
+        # CallFunction
+        if isinstance(function_call, dict):
+            function_call = FunctionCallResponse(**function_call)
+
+            """
+            In this demo, we did not register any tools.
+            You can use the tools that have been implemented in our `tools_using_demo` and implement your own streaming tool implementation here.
+            Similar to the following method:
+                function_args = json.loads(function_call.arguments)
+                tool_response = dispatch_tool(tool_name: str, tool_params: dict)
+            """
+            tool_response = ""
+
+            if not gen_params.get("messages"):
+                gen_params["messages"] = []
+
+            gen_params["messages"].append(ChatMessage(
+                role="assistant",
+                content=output,
+            ))
+            gen_params["messages"].append(ChatMessage(
+                role="function",
+                name=function_call.name,
+                content=tool_response,
+            ))
+
+            # Streaming output of results after function calls
+            generate = predict(request.model, gen_params)
+            return EventSourceResponse(generate, media_type="text/event-stream")
+
+        else:
+            # Handled to avoid exceptions in the above parsing function process.
+            generate = parse_output_text(request.model, output)
+            return EventSourceResponse(generate, media_type="text/event-stream")
+
+    # Here is the handling of stream = False
+    response = generate_chatglm3(model, tokenizer, gen_params)
+
+    # Remove the first newline character
+    if response["text"].startswith("\n"):
+        response["text"] = response["text"][1:]
+    response["text"] = response["text"].strip()
+
+    usage = UsageInfo()
+    function_call, finish_reason = None, "stop"
+    if request.tools:
+        try:
+            function_call = process_response(response["text"], use_tool=True)
+        except:
+            logger.warning("Failed to parse tool call, maybe the response is not a tool call or have been answered.")
+
+    if isinstance(function_call, dict):
+        finish_reason = "function_call"
+        function_call = FunctionCallResponse(**function_call)
+
+    message = ChatMessage(
+        role="assistant",
+        content=response["text"],
+        function_call=function_call if isinstance(function_call, FunctionCallResponse) else None,
+    )
+
+    logger.debug(f"==== message ====\n{message}")
+
+    choice_data = ChatCompletionResponseChoice(
+        index=0,
+        message=message,
+        finish_reason=finish_reason,
+    )
+    task_usage = UsageInfo.model_validate(response["usage"])
+    for usage_key, usage_value in task_usage.model_dump().items():
+        setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)
+
+    return ChatCompletionResponse(
+        model=request.model,
+        id="",  # for open_source model, id is empty
+        choices=[choice_data],
+        object="chat.completion",
+        usage=usage
+    )
+
+
+async def predict(model_id: str, params: dict):
+    global model, tokenizer
+
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(role="assistant"),
+        finish_reason=None
+    )
+    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
+    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+
+    previous_text = ""
+    for new_response in generate_stream_chatglm3(model, tokenizer, params):
+        decoded_unicode = new_response["text"]
+        delta_text = decoded_unicode[len(previous_text):]
+        previous_text = decoded_unicode
+
+        finish_reason = new_response["finish_reason"]
+        if len(delta_text) == 0 and finish_reason != "function_call":
+            continue
+
+        function_call = None
+        if finish_reason == "function_call":
+            try:
+                function_call = process_response(decoded_unicode, use_tool=True)
+            except:
+                logger.warning(
+                    "Failed to parse tool call, maybe the response is not a tool call or have been answered.")
+
+        if isinstance(function_call, dict):
+            function_call = FunctionCallResponse(**function_call)
+
+        delta = DeltaMessage(
+            content=delta_text,
+            role="assistant",
+            function_call=function_call if isinstance(function_call, FunctionCallResponse) else None,
+        )
+
+        choice_data = ChatCompletionResponseStreamChoice(
+            index=0,
+            delta=delta,
+            finish_reason=finish_reason
+        )
+        chunk = ChatCompletionResponse(
+            model=model_id,
+            id="",
+            choices=[choice_data],
+            object="chat.completion.chunk"
+        )
+        yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(),
+        finish_reason="stop"
+    )
+    chunk = ChatCompletionResponse(
+        model=model_id,
+        id="",
+        choices=[choice_data],
+        object="chat.completion.chunk"
+    )
+    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+    yield '[DONE]'
+
+
+def predict_stream(model_id, gen_params):
+    """
+    The function call is compatible with stream mode output.
+
+    The first seven characters are determined.
+    If not a function call, the stream output is directly generated.
+    Otherwise, the complete character content of the function call is returned.
+
+    :param model_id:
+    :param gen_params:
+    :return:
+    """
+    output = ""
+    is_function_call = False
+    has_send_first_chunk = False
+    for new_response in generate_stream_chatglm3(model, tokenizer, gen_params):
+        decoded_unicode = new_response["text"]
+        delta_text = decoded_unicode[len(output):]
+        output = decoded_unicode
+
+        # When it is not a function call and the character length is> 7,
+        # try to judge whether it is a function call according to the special function prefix
+        if not is_function_call and len(output) > 7:
+
+            # Determine whether a function is called
+            is_function_call = contains_custom_function(output)
+            if is_function_call:
+                continue
+
+            # Non-function call, direct stream output
+            finish_reason = new_response["finish_reason"]
+
+            # Send an empty string first to avoid truncation by subsequent next() operations.
+            if not has_send_first_chunk:
+                message = DeltaMessage(
+                    content="",
+                    role="assistant",
+                    function_call=None,
+                )
+                choice_data = ChatCompletionResponseStreamChoice(
+                    index=0,
+                    delta=message,
+                    finish_reason=finish_reason
+                )
+                chunk = ChatCompletionResponse(
+                    model=model_id,
+                    id="",
+                    choices=[choice_data],
+                    created=int(time.time()),
+                    object="chat.completion.chunk"
+                )
+                yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+
+            send_msg = delta_text if has_send_first_chunk else output
+            has_send_first_chunk = True
+            message = DeltaMessage(
+                content=send_msg,
+                role="assistant",
+                function_call=None,
+            )
+            choice_data = ChatCompletionResponseStreamChoice(
+                index=0,
+                delta=message,
+                finish_reason=finish_reason
+            )
+            chunk = ChatCompletionResponse(
+                model=model_id,
+                id="",
+                choices=[choice_data],
+                created=int(time.time()),
+                object="chat.completion.chunk"
+            )
+            yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+
+    if is_function_call:
+        yield output
+    else:
+        yield '[DONE]'
+
+
+async def parse_output_text(model_id: str, value: str):
+    """
+    Directly output the text content of value
+
+    :param model_id:
+    :param value:
+    :return:
+    """
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(role="assistant", content=value),
+        finish_reason=None
+    )
+    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
+    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+
+    choice_data = ChatCompletionResponseStreamChoice(
+        index=0,
+        delta=DeltaMessage(),
+        finish_reason="stop"
+    )
+    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
+    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
+    yield '[DONE]'
+
+
+def contains_custom_function(value: str) -> bool:
+    """
+    Determine whether 'function_call' according to a special function prefix.
+
+    For example, the functions defined in "tools_using_demo/tool_register.py" are all "get_xxx" and start with "get_"
+
+    [Note] This is not a rigorous judgment method, only for reference.
+
+    :param value:
+    :return:
+    """
+    return value and 'get_' in value
+
+if __name__ == "__main__":
+    # Load LLM
+    tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)
+    model = AutoModel.from_pretrained(MODEL_PATH,
+                                      load_in_4bit=True,
+                                      trust_remote_code=True)
+    # load Embedding
+    # embedding_model = SentenceTransformer(EMBEDDING_PATH, device="cuda")
+    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
--- a/Intel_device_demo/ipex_llm_cpu_demo/chatglm3_infer.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/chatglm3_infer.py
+import time
+from ipex_llm.transformers import AutoModel
+from transformers import AutoTokenizer
+
+CHATGLM_V3_PROMPT_FORMAT = "\n{prompt}\n"
+
+# Please specify the local path to the chatglm3-6b model
+
+model_path = "D:\AI\ChatGLM3\model/chatglm3-6b/"
+
+# Load the ChatGLM3-6B model and quantize it to INT4
+model = AutoModel.from_pretrained(model_path,
+                                  load_in_4bit=True,
+                                  trust_remote_code=True)
+# Load the tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_path,
+                                          trust_remote_code=True)
+# Prepare ChatGLM3 format prompt
+prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="Who are you?")
+# Encode the prompt
+input_ids = tokenizer.encode(prompt, return_tensors="pt")
+st = time.time()
+# Perform inference calculation and generate Tokens
+output = model.generate(input_ids,max_new_tokens=32)
+end = time.time()
+# Decode the generated Tokens and display them
+output_str = tokenizer.decode(output[0], skip_special_tokens=True)
+print(f'Inference time: {end-st} s')
+print('-'*20, 'Prompt', '-'*20)
+print(prompt)
+print('-'*20, 'Output', '-'*20)
+print(output_str)
--- a/Intel_device_demo/ipex_llm_cpu_demo/chatglm3_web_demo.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/chatglm3_web_demo.py
+"""
+This script creates an interactive web demo for the ChatGLM3-6B model using Gradio,
+a Python library for building quick and easy UI components for machine learning models.
+It's designed to showcase the capabilities of the ChatGLM3-6B model in a user-friendly interface,
+allowing users to interact with the model through a chat-like interface.
+
+Usage:
+- Run the script to start the Gradio web server.
+- Interact with the model by typing questions and receiving responses.
+
+Requirements:
+- Gradio (required for 4.13.0 and later, 3.x is not support now) should be installed.
+
+Note: The script includes a modification to the Chatbot's postprocess method to handle markdown to HTML conversion,
+ensuring that the chat interface displays formatted text correctly.
+
+"""
+
+import os
+import streamlit as st
+from ipex_llm.transformers import AutoModel
+from transformers import AutoTokenizer
+
+
+st.set_page_config(
+    page_title="ChatGLM3-6B+BigDL-LLM demo",
+    page_icon=":robot:",
+    layout="wide"
+)
+
+MODEL_PATH = os.environ.get('MODEL_PATH', 'THUDM/chatglm3-6b')
+
+@st.cache_resource
+def get_model():
+    model = AutoModel.from_pretrained(MODEL_PATH,
+                                    load_in_4bit=True,
+                                    trust_remote_code=True)
+    
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH,
+                                            trust_remote_code=True)
+    return tokenizer, model
+
+tokenizer, model = get_model()
+
+if "history" not in st.session_state:
+    st.session_state.history = []
+if "past_key_values" not in st.session_state:
+    st.session_state.past_key_values = None
+
+max_length = st.sidebar.slider("max_length", 0, 32768, 8192, step=1)
+top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01)
+temperature = st.sidebar.slider("temperature", 0.0, 1.0, 0.6, step=0.01)
+
+buttonClean = st.sidebar.button("clearing session history", key="clean")
+if buttonClean:
+    st.session_state.history = []
+    st.session_state.past_key_values = None
+    st.rerun()
+
+for i, message in enumerate(st.session_state.history):
+    if message["role"] == "user":
+        with st.chat_message(name="user", avatar="user"):
+            st.markdown(message["content"])
+    else:
+        with st.chat_message(name="assistant", avatar="assistant"):
+            st.markdown(message["content"])
+
+with st.chat_message(name="user", avatar="user"):
+    input_placeholder = st.empty()
+with st.chat_message(name="assistant", avatar="assistant"):
+    message_placeholder = st.empty()
+
+prompt_text = st.chat_input("please enter your question.")
+
+if prompt_text:
+
+    input_placeholder.markdown(prompt_text)
+    history = st.session_state.history
+    past_key_values = st.session_state.past_key_values
+    for response, history, past_key_values in model.stream_chat(
+        tokenizer,
+        prompt_text,
+        history,
+        past_key_values=past_key_values,
+        max_length=max_length,
+        top_p=top_p,
+        temperature=temperature,
+        return_past_key_values=True,
+    ):
+        message_placeholder.markdown(response)
+
+    st.session_state.history = history
+    st.session_state.past_key_values = past_key_values
\ No newline at end of file
--- a/Intel_device_demo/ipex_llm_cpu_demo/generate.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/generate.py
+import torch
+import time
+import argparse
+import numpy as np
+
+from ipex_llm.transformers import AutoModel
+from modelscope import AutoTokenizer
+from transformers import AutoTokenizer
+
+# you could tune the prompt based on your own model,
+# here the prompt tuning refers to https://github.com/THUDM/ChatGLM3/blob/main/PROMPT.md
+CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ModelScope ChatGLM3 model')
+    parser.add_argument('--repo-id-or-model-path', type=str, default="ZhipuAI/chatglm3-6b",
+                        help='The ModelScope repo id for the ChatGLM3 model to be downloaded'
+                             ', or the path to the ModelScope checkpoint folder')
+    parser.add_argument('--prompt', type=str, default="AI是什么？",
+                        help='Prompt to infer')
+    parser.add_argument('--n-predict', type=int, default=32,
+                        help='Max tokens to predict')
+
+    args = parser.parse_args()
+    model_path = args.repo_id_or_model_path
+
+    # Load model in 4 bit,
+    # which convert the relevant layers in the model into INT4 format
+    # It is important to set `model_hub='modelscope'`, otherwise model hub is default to be huggingface
+    model = AutoModel.from_pretrained(model_path,
+                                      load_in_4bit=True,
+                                      trust_remote_code=True,
+                                      model_hub='modelscope')
+
+    # Load tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_path,
+                                              trust_remote_code=True)
+
+    # Generate predicted tokens
+    with torch.inference_mode():
+        prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
+        input_ids = tokenizer.encode(prompt, return_tensors="pt")
+        st = time.time()
+        # if your selected model is capable of utilizing previous key/value attentions
+        # to enhance decoding speed, but has `"use_cache": false` in its model config,
+        # it is important to set `use_cache=True` explicitly in the `generate` function
+        # to obtain optimal performance with IPEX-LLM INT4 optimizations
+        output = model.generate(input_ids,
+                                max_new_tokens=args.n_predict)
+        end = time.time()
+        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
+        print(f'Inference time: {end - st} s')
+        print('-' * 20, 'Prompt', '-' * 20)
+        print(prompt)
+        print('-' * 20, 'Output', '-' * 20)
+        print(output_str)
\ No newline at end of file
--- a/Intel_device_demo/ipex_llm_cpu_demo/openai_api_request.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/openai_api_request.py
+"""
+This script is an example of using the OpenAI API to create various interactions with a ChatGLM3 model.
+It includes functions to:
+
+1. Conduct a basic chat session, asking about weather conditions in multiple cities.
+2. Initiate a simple chat in Chinese, asking the model to tell a short story.
+3. Retrieve and print embeddings for a given text input.
+
+Each function demonstrates a different aspect of the API's capabilities, showcasing how to make requests
+and handle responses.
+"""
+
+from openai import OpenAI
+import time
+
+base_url = "http://127.0.0.1:8000/v1/"
+client = OpenAI(api_key="EMPTY", base_url=base_url)
+
+
+def function_chat():
+    messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
+    tools = [
+        {
+            "type": "function",
+            "function": {
+                "name": "get_current_weather",
+                "description": "Get the current weather in a given location",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "location": {
+                            "type": "string",
+                            "description": "The city and state, e.g. San Francisco, CA",
+                        },
+                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
+                    },
+                    "required": ["location"],
+                },
+            },
+        }
+    ]
+
+    response = client.chat.completions.create(
+        model="chatglm3-6b",
+        messages=messages,
+        tools=tools,
+        tool_choice="auto",
+    )
+    if response:
+        content = response.choices[0].message.content
+        print(content)
+    else:
+        print("Error:", response.status_code)
+
+
+def simple_chat(use_stream=True):
+    messages = [
+        {
+            "role": "system",
+            "content": "You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's "
+                       "instructions carefully. Respond using markdown.",
+        },
+        {
+            "role": "user",
+            "content": "你好，请你用生动的话语给我讲一个小故事吧"
+        }
+    ]
+    response = client.chat.completions.create(
+        model="chatglm3-6b",
+        messages=messages,
+        stream=use_stream,
+        max_tokens=256,
+        temperature=0.8,
+        presence_penalty=1.1,
+        top_p=0.8)
+    if response:
+        if use_stream:
+            for chunk in response:
+                print(chunk.choices[0].delta.content)
+        else:
+            content = response.choices[0].message.content
+            print(content)
+    else:
+        print("Error:", response.status_code)
+
+
+if __name__ == "__main__":
+    simple_chat(use_stream=False)
+    simple_chat(use_stream=True)
--- a/Intel_device_demo/ipex_llm_cpu_demo/utils.py
+++ b/Intel_device_demo/ipex_llm_cpu_demo/utils.py
+import gc
+import json
+import torch
+from transformers import PreTrainedModel, PreTrainedTokenizer
+from transformers.generation.logits_process import LogitsProcessor
+from typing import Union, Tuple
+
+
+class InvalidScoreLogitsProcessor(LogitsProcessor):
+    def __call__(
+            self, input_ids: torch.LongTensor, scores: torch.FloatTensor
+    ) -> torch.FloatTensor:
+        if torch.isnan(scores).any() or torch.isinf(scores).any():
+            scores.zero_()
+            scores[..., 5] = 5e4
+        return scores
+
+
+def process_response(output: str, use_tool: bool = False) -> Union[str, dict]:
+    content = ""
+    for response in output.split("<|assistant|>"):
+        metadata, content = response.split("\n", maxsplit=1)
+        if not metadata.strip():
+            content = content.strip()
+            content = content.replace("[[训练时间]]", "2023年")
+        else:
+            if use_tool:
+                content = "\n".join(content.split("\n")[1:-1])
+
+                def tool_call(**kwargs):
+                    return kwargs
+
+                parameters = eval(content)
+                content = {
+                    "name": metadata.strip(),
+                    "arguments": json.dumps(parameters, ensure_ascii=False)
+                }
+            else:
+                content = {
+                    "name": metadata.strip(),
+                    "content": content
+                }
+    return content
+
+
+@torch.inference_mode()
+def generate_stream_chatglm3(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, params: dict):
+    messages = params["messages"]
+    tools = params["tools"]
+    temperature = float(params.get("temperature", 1.0))
+    repetition_penalty = float(params.get("repetition_penalty", 1.0))
+    top_p = float(params.get("top_p", 1.0))
+    max_new_tokens = int(params.get("max_tokens", 256))
+    echo = params.get("echo", True)
+    messages = process_chatglm_messages(messages, tools=tools)
+    query, role = messages[-1]["content"], messages[-1]["role"]
+
+    inputs = tokenizer.build_chat_input(query, history=messages[:-1], role=role)
+    inputs = inputs.to(model.device)
+    input_echo_len = len(inputs["input_ids"][0])
+
+    if input_echo_len >= model.config.seq_length:
+        print(f"Input length larger than {model.config.seq_length}")
+
+    eos_token_id = [
+        tokenizer.eos_token_id,
+        tokenizer.get_command("<|user|>"),
+    ]
+
+    gen_kwargs = {
+        "max_new_tokens": max_new_tokens,
+        "do_sample": True if temperature > 1e-5 else False,
+        "top_p": top_p,
+        "repetition_penalty": repetition_penalty,
+        "logits_processor": [InvalidScoreLogitsProcessor()],
+    }
+    if temperature > 1e-5:
+        gen_kwargs["temperature"] = temperature
+
+    total_len = 0
+    for total_ids in model.stream_generate(**inputs, eos_token_id=eos_token_id, **gen_kwargs):
+        total_ids = total_ids.tolist()[0]
+        total_len = len(total_ids)
+        if echo:
+            output_ids = total_ids[:-1]
+        else:
+            output_ids = total_ids[input_echo_len:-1]
+
+        response = tokenizer.decode(output_ids)
+        if response and response[-1] != "�":
+            response, stop_found = apply_stopping_strings(response, ["<|observation|>"])
+
+            yield {
+                "text": response,
+                "usage": {
+                    "prompt_tokens": input_echo_len,
+                    "completion_tokens": total_len - input_echo_len,
+                    "total_tokens": total_len,
+                },
+                "finish_reason": "function_call" if stop_found else None,
+            }
+
+            if stop_found:
+                break
+
+    # Only last stream result contains finish_reason, we set finish_reason as stop
+    ret = {
+        "text": response,
+        "usage": {
+            "prompt_tokens": input_echo_len,
+            "completion_tokens": total_len - input_echo_len,
+            "total_tokens": total_len,
+        },
+        "finish_reason": "stop",
+    }
+    yield ret
+
+    gc.collect()
+    torch.cuda.empty_cache()
+
+
+def process_chatglm_messages(messages, tools=None):
+    _messages = messages
+    messages = []
+    if tools:
+        messages.append(
+            {
+                "role": "system",
+                "content": "Answer the following questions as best as you can. You have access to the following tools:",
+                "tools": tools
+            }
+        )
+
+    for m in _messages:
+        role, content, func_call = m.role, m.content, m.function_call
+        if role == "function":
+            messages.append(
+                {
+                    "role": "observation",
+                    "content": content
+                }
+            )
+
+        elif role == "assistant" and func_call is not None:
+            for response in content.split("<|assistant|>"):
+                metadata, sub_content = response.split("\n", maxsplit=1)
+                messages.append(
+                    {
+                        "role": role,
+                        "metadata": metadata,
+                        "content": sub_content.strip()
+                    }
+                )
+        else:
+            messages.append({"role": role, "content": content})
+    return messages
+
+
+def generate_chatglm3(model: PreTrainedModel, tokenizer: PreTrainedTokenizer, params: dict):
+    for response in generate_stream_chatglm3(model, tokenizer, params):
+        pass
+    return response
+
+
+def apply_stopping_strings(reply, stop_strings) -> Tuple[str, bool]:
+    stop_found = False
+    for string in stop_strings:
+        idx = reply.find(string)
+        if idx != -1:
+            reply = reply[:idx]
+            stop_found = True
+            break
+
+    if not stop_found:
+        # If something like "\nYo" is generated just before "\nYou: is completed, trim it
+        for string in stop_strings:
+            for j in range(len(string) - 1, 0, -1):
+                if reply[-j:] == string[:j]:
+                    reply = reply[:-j]
+                    break
+            else:
+                continue
+
+            break
+
+    return reply, stop_found
--- a/Intel_device_demo/openvino_demo/README.md
+++ b/Intel_device_demo/openvino_demo/README.md
+
+# 使用 OpenVINO 部署ChatGLM3-6B 模型
+
+[OpenVINO](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html) 是 Intel 为深度学习推理而设计的开源工具包。它可以帮助开发者优化模型，提高推理性能，减少模型的内存占用。本示例将展示如何使用 OpenVINO 部署 ChatGLM3。
+
+你需要克隆本仓库，然后按照以下步骤进行操作来将模型转换你的 OpenVINO IR 模型，随后进行推理。
+## 1. 环境配置
+
+首先，克隆OpenVINO GLM3 推理仓库并安装依赖。
+
+```bash
+git clone https://github.com/OpenVINO-dev-contest/chatglm3.openvino.git
+cd chatglm3.openvino
+```
+
+接着，我们推荐您新建一个虚拟环境，然后按照以下安装依赖。
+
+```
+python3 -m venv openvino_env
+
+source openvino_env/bin/activate
+
+python3 -m pip install --upgrade pip
+
+pip install wheel setuptools
+
+pip install -r requirements.txt
+```
+
+## 2. 转换模型
+
+由于需要将Huggingface模型转换为OpenVINO IR模型，因此您需要下载模型并转换。
+
+```
+python3 convert.py --model_id THUDM/chatglm3-6b --output {your_path}/chatglm3-6b 
+```
+
+### 可以选择的参数
+
+* `--model_id` - 模型所在目录的路径（绝对路径）。
+* `--output` - 转换后模型保存的地址
+
+## 3. 量化模型（非必须）
+
+```
+python3 quantize.py --model_path {your_path}/chatglm3-6b --precision int4 --output {your_path}/chatglm3-6b-int4
+```
+
+### 可以选择的参数
+
+* `--model_path` - OpenVINO IR 模型所在目录的路径。
+* `-- precision` - 量化精度：int8 或 int4。
+* `--output` - 保存模型的路径。
+
+## 4. 运行 ChatGLM3 模型
+
+```
+python3 chat.py --model_path {your_path}/chatglm3-6b --max_sequence_length 4096 --device CPU
+```
+
+### 可以选择的参数
+
+* `--model_path` - OpenVINO IR 模型所在目录的路径。
+* `--max_sequence_length` - 输出标记的最大大小。
+* `--device` - 运行推理的设备。
+
+## 例子
+
+```
+用户: 你好
+ChatGLM3-6B-OpenVINO: 你好！有什么我可以帮助你的吗？
+
+用户: 你是谁？     
+ChatGLM3-6B-OpenVINO: 我是一个名为ChatGLM3-6B的人工智能助手，是由清华大学KEG实验室和智谱AI 公司于2023 年共同训练的语言模型开发而成。我的任务是针对用户的问题和要求提供适当的答复和支持。
+
+用户: 请给我讲一个有趣的故事
+ChatGLM3-6B-OpenVINO: 从前，有一个名叫小明的小男孩，他是一个非常喜欢动物的人。有一天，他在森林里散步时，发现了一个非常漂亮的小鸟。小鸟受伤了，无法飞行。小明非常心疼，于是决定照顾这只小鸟。小明带着小鸟回家，为它搭建了一个小小的巢穴，并找来了一些软草和食物。每天，他都会给小鸟喂食，并为它换水。渐渐地，小鸟的伤势好了起来，开始在小明的家里飞来飞去，它们成了非常好的朋友。然而，一天，小明的父母告诉他，他们必须把小明养的小鸟送到森林里去。小明非常伤心，因为他已经和小鸟成为了好朋友。但是，他的父母告诉他，小鸟在森林里会更加自由自在，而且他也可以继续观看小鸟在森林中的生活。于是，小明和他的父母一起将小鸟送到了森林中。小鸟非常高兴，因为它又可以飞行了，并且还有许多其他的小动物朋友。小明也感到非常开心，因为他知道，即使不能一直拥有小鸟，他仍然可以欣赏到它们在自然中的美丽。从此以后，小明常常来到森林中，寻找小鸟。
+
+用户: 请给这个故事起一个标题
+ChatGLM3-6B-OpenVINO: 《友谊的力量：小明与小鸟的森林冒险》
+```
+
+## 常见问题
+1. 为什么倒入本地模型还会报 huggingface 链接错误
+   - 降级 transformers 库到 4.37.2 版本
+
+2. 需要安装 OpenVINO C++ 推理引擎吗
+   - 不需要
+
+3. 一定要使用 Intel 的硬件吗？
+   - 我们仅在 Intel 设备上尝试，我们推荐使用x86架构的英特尔设备，包括但不限制于：
+   - 英特尔的CPU，包括个人电脑CPU 和服务器CPU。
+   - 英特尔的独立显卡。例如：ARC A770 显卡。
\ No newline at end of file
--- a/Intel_device_demo/openvino_demo/openvino_cli_demo.py
+++ b/Intel_device_demo/openvino_demo/openvino_cli_demo.py
+import argparse
+from typing import List, Tuple
+from threading import Thread
+import torch
+from optimum.intel.openvino import OVModelForCausalLM
+from transformers import (AutoTokenizer, AutoConfig,
+                          TextIteratorStreamer, StoppingCriteriaList, StoppingCriteria)
+
+
+def parse_text(text):
+    lines = text.split("\n")
+    lines = [line for line in lines if line != ""]
+    count = 0
+    for i, line in enumerate(lines):
+        if "```" in line:
+            count += 1
+            items = line.split('`')
+            if count % 2 == 1:
+                lines[i] = f'<pre><code class="language-{items[-1]}">'
+            else:
+                lines[i] = f'<br></code></pre>'
+        else:
+            if i > 0:
+                if count % 2 == 1:
+                    line = line.replace("`", "\`")
+                    line = line.replace("<", "&lt;")
+                    line = line.replace(">", "&gt;")
+                    line = line.replace(" ", "&nbsp;")
+                    line = line.replace("*", "&ast;")
+                    line = line.replace("_", "&lowbar;")
+                    line = line.replace("-", "&#45;")
+                    line = line.replace(".", "&#46;")
+                    line = line.replace("!", "&#33;")
+                    line = line.replace("(", "&#40;")
+                    line = line.replace(")", "&#41;")
+                    line = line.replace("$", "&#36;")
+                lines[i] = "<br>" + line
+    text = "".join(lines)
+    return text
+
+
+class StopOnTokens(StoppingCriteria):
+    def __init__(self, token_ids):
+        self.token_ids = token_ids
+
+    def __call__(
+            self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs
+    ) -> bool:
+        for stop_id in self.token_ids:
+            if input_ids[0][-1] == stop_id:
+                return True
+        return False
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(add_help=False)
+    parser.add_argument('-h',
+                        '--help',
+                        action='help',
+                        help='Show this help message and exit.')
+    parser.add_argument('-m',
+                        '--model_path',
+                        required=True,
+                        type=str,
+                        help='Required. model path')
+    parser.add_argument('-l',
+                        '--max_sequence_length',
+                        default=256,
+                        required=False,
+                        type=int,
+                        help='Required. maximun length of output')
+    parser.add_argument('-d',
+                        '--device',
+                        default='CPU',
+                        required=False,
+                        type=str,
+                        help='Required. device for inference')
+    args = parser.parse_args()
+    model_dir = args.model_path
+
+    ov_config = {"PERFORMANCE_HINT": "LATENCY",
+                 "NUM_STREAMS": "1", "CACHE_DIR": ""}
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_dir, trust_remote_code=True)
+
+    print("====Compiling model====")
+    ov_model = OVModelForCausalLM.from_pretrained(
+        model_dir,
+        device=args.device,
+        ov_config=ov_config,
+        config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
+        trust_remote_code=True,
+    )
+
+    streamer = TextIteratorStreamer(
+        tokenizer, timeout=60.0, skip_prompt=True, skip_special_tokens=True
+    )
+    stop_tokens = [0, 2]
+    stop_tokens = [StopOnTokens(stop_tokens)]
+
+    def convert_history_to_token(history: List[Tuple[str, str]]):
+
+        messages = []
+        for idx, (user_msg, model_msg) in enumerate(history):
+            if idx == len(history) - 1 and not model_msg:
+                messages.append({"role": "user", "content": user_msg})
+                break
+            if user_msg:
+                messages.append({"role": "user", "content": user_msg})
+            if model_msg:
+                messages.append({"role": "assistant", "content": model_msg})
+
+        model_inputs = tokenizer.apply_chat_template(messages,
+                                                     add_generation_prompt=True,
+                                                     tokenize=True,
+                                                     return_tensors="pt")
+        return model_inputs
+
+    history = []
+    print("====Starting conversation====")
+    while True:
+        input_text = input("用户: ")
+        if input_text.lower() == 'stop':
+            break
+
+        if input_text.lower() == 'clear':
+            history = []
+            print("AI助手: 对话历史已清空")
+            continue
+
+        print("ChatGLM3-6B-OpenVINO:", end=" ")
+        history = history + [[parse_text(input_text), ""]]
+        model_inputs = convert_history_to_token(history)
+        generate_kwargs = dict(
+            input_ids=model_inputs,
+            max_new_tokens=args.max_sequence_length,
+            temperature=0.1,
+            do_sample=True,
+            top_p=1.0,
+            top_k=50,
+            repetition_penalty=1.1,
+            streamer=streamer,
+            stopping_criteria=StoppingCriteriaList(stop_tokens)
+        )
+
+        t1 = Thread(target=ov_model.generate, kwargs=generate_kwargs)
+        t1.start()
+
+        partial_text = ""
+        for new_text in streamer:
+            new_text = new_text
+            print(new_text, end="", flush=True)
+            partial_text += new_text
+        print("\n")
+        history[-1][1] = partial_text
\ No newline at end of file
--- a/LICENSE
+++ b/LICENSE
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright 2024 ChatGLM team @ Zhipu AI
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
--- a/MODEL_LICENSE
+++ b/MODEL_LICENSE
+The ChatGLM3-6B License
+
+1. 定义
+
+“许可方”是指分发其软件的 ChatGLM3-6B 模型团队。
+
+“软件”是指根据本许可提供的 ChatGLM3-6B 模型参数。
+
+2. 许可授予
+
+根据本许可的条款和条件，许可方特此授予您非排他性、全球性、不可转让、不可再许可、可撤销、免版税的版权许可。
+
+上述版权声明和本许可声明应包含在本软件的所有副本或重要部分中。
+
+3.限制
+
+您不得出于任何军事或非法目的使用、复制、修改、合并、发布、分发、复制或创建本软件的全部或部分衍生作品。
+
+您不得利用本软件从事任何危害国家安全和国家统一、危害社会公共利益、侵犯人身权益的行为。
+
+4.免责声明
+
+本软件“按原样”提供，不提供任何明示或暗示的保证，包括但不限于对适销性、特定用途的适用性和非侵权性的保证。 在任何情况下，作者或版权持有人均不对任何索赔、损害或其他责任负责，无论是在合同诉讼、侵权行为还是其他方面，由软件或软件的使用或其他交易引起、由软件引起或与之相关 软件。
+
+5. 责任限制
+
+除适用法律禁止的范围外，在任何情况下且根据任何法律理论，无论是基于侵权行为、疏忽、合同、责任或其他原因，任何许可方均不对您承担任何直接、间接、特殊、偶然、示范性、 或间接损害，或任何其他商业损失，即使许可人已被告知此类损害的可能性。
+
+6.争议解决
+
+本许可受中华人民共和国法律管辖并按其解释。 因本许可引起的或与本许可有关的任何争议应提交北京市海淀区人民法院。
+
+请注意，许可证可能会更新到更全面的版本。 有关许可和版权的任何问题，请通过 license@zhipuai.cn 与我们联系。
+
+1. Definitions
+
+“Licensor” means the ChatGLM3-6B Model Team that distributes its Software.
+
+“Software” means the ChatGLM3-6B model parameters made available under this license.
+
+2. License Grant
+
+Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software.
+
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+
+3. Restriction
+
+You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any military, or illegal purposes.
+
+You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
+
+4. Disclaimer
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+5. Limitation of Liability
+
+EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
+
+6. Dispute Resolution
+
+This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
+
+Note that the license is subject to update to a more comprehensive version.  For any questions related to the license and copyright, please contact us at license@zhipuai.cn.
--- a/PROMPT.md
+++ b/PROMPT.md
+## ChatGLM3 对话格式
+为了避免用户输入的注入攻击，以及统一 Code Interpreter，Tool & Agent 等任务的输入，ChatGLM3 采用了全新的对话格式。
+
+### 规定
+#### 整体结构
+ChatGLM3 对话的格式由若干对话组成，其中每个对话包含对话头和内容，一个典型的多轮对话结构如下
+```text
+<|system|>
+You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.
+<|user|>
+Hello
+<|assistant|>
+Hello, I'm ChatGLM3. What can I assist you today?
+```
+**实际中每轮对话内容并不一定以换行符结尾，这里只是为了美观，下同**
+
+#### 对话头
+对话头占完整的一行，格式为
+```text
+<|role|>{metadata}
+```
+其中 `<|role|>` 部分使用 special token 表示，无法从文本形式被 tokenizer 编码以防止注入。metadata 部分采用纯文本表示，为可选内容。
+* `<|system|>`：系统信息，设计上可穿插于对话中，**但目前规定仅可以出现在开头**
+* `<|user|>`：用户
+  - 不会连续出现多个来自 `<|user|>` 的信息
+* `<|assistant|>`：AI 助手
+  - 在出现之前必须有一个来自 `<|user|>` 的信息
+* `<|observation|>`：外部的返回结果
+  - 必须在 `<|assistant|>` 的信息之后
+
+### 样例场景
+
+为提升可读性，下列样例场景中表示角色的 special token 前均额外添加了一个换行符。实际使用及 tokenizer 实现中均无需额外添加这一换行。
+
+#### 多轮对话
+* 有且仅有 `<|user|>`、`<|assistant|>`、`<|system|>` 三种 role
+```text
+<|system|>
+You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.
+<|user|>
+Hello
+<|assistant|>
+Hello, I'm ChatGLM3. What can I assist you today?
+```
+
+#### 工具调用
+````
+<|system|>
+Answer the following questions as best as you can. You have access to the following tools:
+[
+    {
+        "name": "get_current_weather",
+        "description": "Get the current weather in a given location",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "location": {
+                    "type": "string",
+                    "description": "The city and state, e.g. San Francisco, CA",
+                },
+                "unit": {"type": "string"},
+            },
+            "required": ["location"],
+        },
+    }
+]
+<|user|>
+今天北京的天气怎么样？
+<|assistant|>
+好的，让我们来查看今天的天气
+<|assistant|>get_current_weather
+```python
+tool_call(location="beijing", unit="celsius")
+```
+<|observation|>
+{"temperature": 22}
+<|assistant|>
+根据查询结果，今天北京的气温为 22 摄氏度。
+````
+
+#### 代码执行
+* 有 `<|user|>`、`<|assistant|>`、`<|system|>`、`<|observation|>` 四种 role。其中 `<|assistant|>` 的 metadata 只有 interpreter。
+
+`````text
+<|system|>
+你是一位智能AI助手，你叫ChatGLM3，你连接着一台电脑，但请注意不能联网。在使用Python解决任务时，你可以运行代码并得到结果，如果运行结果有错误，你需要尽可能对代码进行改进。你可以处理用户上传到电脑上的文件，文件默认存储路径是/mnt/data/。
+<|user|>
+#File: /mnt/data/metadata.jsonl
+#Size: 35380
+#File uploaded
+文件中是否存在缺失值或异常值？
+<|assistant|>
+首先，我将读取您提供的文件，并查看其内容以确定是否存在缺失值或异常值。
+
+我们从加载文件开始：
+<|assistant|>interpreter
+```python
+import json
+
+# Load the uploaded file
+data = []
+with open("/mnt/data/metadata.jsonl", 'r') as file:
+    for line in file:
+        data.append(json.loads(line))
+
+# Display the first few records to get a sense of the data
+data[:5]
+```
+<|observation|>
+```result
+[{'file_name': 'aceinthehole.png',
+  'name': 'Ace in the Hole',
+  'type': 'survivor',
+  'description': 'Lady Luck always seems to be throwing something good your way.'},
+ {'file_name': 'adrenaline.png',
+  'name': 'Adrenaline',
+  'type': 'survivor',
+  'description': 'You are fuelled by unexpected energy when on the verge of escape.'},
+ {'file_name': 'aftercare.png',
+  'name': 'Aftercare',
+  'type': 'survivor',
+  'description': "Unlocks potential in one's aura reading ability."},
+ {'file_name': 'agitation.png',
+  'name': 'Agitation',
+  'type': 'killer',
+  'description': 'You get excited in anticipation of hooking your prey.'},
+ {'file_name': 'alert.png',
+  'name': 'Alert',
+  'type': 'survivor',
+  'description': 'Your acute senses are on high alert.'}]
+```
+<|assistant|>
+该文件看起来包含有关某些条目的元数据，每个条目有以下字段：
+
+
+- `file_name`: 文件名称
+- `name`: 名称
+- `type`: 类型（例如 "survivor" 或 "killer"）
+- `description`: 描述
+
+接下来，我将检查这些字段是否存在缺失值。
+<|assistant|>interpreter
+```python
+# Check for missing values in each column
+missing_values = {key: sum(1 for item in data if key not in item) for key in data[0].keys()}
+
+missing_values
+```
+<|observation|>
+```result
+{'file_name': 0, 'name': 0, 'type': 0, 'description': 0}
+```
+<|assistant|>
+每个字段都没有缺失值。这意味着每个条目都完整。
+
+接下来，我们可以检查是否存在任何潜在的异常值。这通常需要更深入的分析和对数据的了解，但我可以从一些基本的统计和分布开始进行检查。为了这样做，我们可以查看“type”字段的值分布，看看是否有任何意外的类别。
+<|assistant|>interpreter
+```python
+# Check the distribution of values in the 'type' column
+type_distribution = {key: sum(1 for item in data if item['type'] == key) for key in set(item['type'] for item in data)}
+
+type_distribution
+```
+<|observation|>
+```result
+{'survivor': 116, 'killer': 103}
+```
+<|assistant|>
+“type”字段有两个值：“survivor”和“killer”。其中，“survivor”有116个条目，而“killer”有103个条目。
+
+从这个初步的检查中，我们没有发现任何明显的异常值。
+
+如果您有任何特定的标准或其他字段需要检查，我可以为您提供更详细的分析。您还希望我检查什么呢？
+<|user|>
+帮我画一个爱心
+<|assistant|>interpreter
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+
+# 爱心形状的参数方程
+def heart(t):
+    x = 16 * np.sin(t) ** 3
+    y = 13 * np.cos(t) - 5 * np.cos(2 * t) - 2 * np.cos(3 * t) - np.cos(4 * t)
+    return x, y
+
+t = np.linspace(0, 2 * np.pi, 1000)
+x, y = heart(t)
+
+plt.figure(figsize=(6, 6))
+plt.plot(x, y, color='red')
+plt.axis('equal')
+plt.axis('off')
+plt.show()
+```
+<|observation|>
+```result
+[Image]
+```
+<|assistant|>
+这是一个爱心形状。我使用了参数方程来描述这个形状，并使用matplotlib进行了绘制。如果您有任何其他需求或问题，请随时告诉我。
+<|user|>  # End
+````
\ No newline at end of file
--- a/PROMPT_en.md
+++ b/PROMPT_en.md
+## ChatGLM3 Chat Format
+To avoid injection attacks from user input, and to unify the input of Code Interpreter, Tool & Agent and other tasks, ChatGLM3 adopts a brand-new dialogue format.
+
+### Regulations
+#### Overall Structure
+The format of the ChatGLM3 dialogue consists of several conversations, each of which contains a dialogue header and content. A typical multi-turn dialogue structure is as follows:
+```text
+<|system|>
+You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.
+<|user|>
+Hello
+<|assistant|>
+Hello, I'm ChatGLM3. What can I assist you today?
+```
+
+#### Chat Header
+The chat header occupies a complete line, formatted as:
+```text
+<|role|>{metadata}
+```
+Where `<|role|>` part is represented in a special token,  which can’t be encoded by the tokenizer from the text form to prevent injection attacks. The `metadata` part is represented in plain texts and is optional content.
+* `<|system|>`: System information, which can be interspersed in the dialogue in design, **but currently only appears at the beginning**
+* `<|user|>`: User
+  - Multiple messages from `<|user|>` will not appear continuously
+* `<|assistant|>`: AI assistant
+  - There must be a message from `<|user|>` before it appears
+* `<|observation|>`: External return result
+  - Must be after the message from `<|assistant|>`
+
+### Example Scenarios
+
+For better readability, an extra `\n` is added before each role special token. This extra `\n` should not be added in actual use and tokenizer implementation.
+
+#### Multi-turn Dialogue
+* There are only three roles: `<|user|>`, `<|assistant|>`, and `<|system|>`.
+```text
+<|system|>
+You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.
+<|user|>
+Hello
+<|assistant|>
+Hello, I'm ChatGLM3. What can I assist you today?
+```
+
+#### Tool Calling
+````
+<|system|>
+Answer the following questions as best as you can. You have access to the following tools:
+[
+    {
+        "name": "get_current_weather",
+        "description": "Get the current weather in a given location",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "location": {
+                    "type": "string",
+                    "description": "The city and state, e.g. San Francisco, CA",
+                },
+                "unit": {"type": "string"},
+            },
+            "required": ["location"],
+        },
+    }
+]
+<|user|>
+What's the weather in Beijing today?
+<|assistant|>
+Okay, let's look up the weather in Bejing today.
+<|assistant|>get_current_weather
+```python
+tool_call(location="beijing", unit="celsius")
+```
+<|observation|>
+{"temperature": 22}
+<|assistant|>
+According to the query results, the temperature in Beijing today is 22 degrees Celsius.
+````
+
+#### Code Execution
+* Four available roles: `<|user|>`, `<|assistant|>`, `<|system|>`, and `<|observation|>`, where `metadata` of `<|assistant|>` can only be interpreter.
+
+`````text
+<|system|>
+You are an intelligent AI assistant, named ChatGLM3. You are connected to a computer, but please note that you cannot connect to the internet. When using Python to solve tasks, you can run the code and get the results. If there are errors in the running results, you need to improve the code as much as possible. You can process files uploaded by users to your computer, and the default storage path for files is /mnt/data/.
+<|user|>
+#File: /mnt/data/metadata.jsonl
+#Size: 35380
+#File uploaded
+Are there missing or abnormal values in the file?
+<|assistant|>
+Firstly, I will read the file you provided and review its contents to determine if there are missing or abnormal values.
+
+We start by loading the file:
+<|assistant|>interpreter
+```python
+import json
+
+# Load the uploaded file
+data = []
+with open("/mnt/data/metadata.jsonl", 'r') as file:
+    for line in file:
+        data.append(json.loads(line))
+
+# Display the first few records to get a sense of the data
+data[:5]
+```
+<|observation|>
+```result
+[{'file_name': 'aceinthehole.png',
+  'name': 'Ace in the Hole',
+  'type': 'survivor',
+  'description': 'Lady Luck always seems to be throwing something good your way.'},
+ {'file_name': 'adrenaline.png',
+  'name': 'Adrenaline',
+  'type': 'survivor',
+  'description': 'You are fuelled by unexpected energy when on the verge of escape.'},
+ {'file_name': 'aftercare.png',
+  'name': 'Aftercare',
+  'type': 'survivor',
+  'description': "Unlocks potential in one's aura reading ability."},
+ {'file_name': 'agitation.png',
+  'name': 'Agitation',
+  'type': 'killer',
+  'description': 'You get excited in anticipation of hooking your prey.'},
+ {'file_name': 'alert.png',
+  'name': 'Alert',
+  'type': 'survivor',
+  'description': 'Your acute senses are on high alert.'}]
+```
+<|assistant|>
+This file appears to contain metadata about certain items, each with the following fields:
+
+- `file_name`
+- `name`
+- `type` (e.g. "survivor" or "killer")
+- `description`
+
+Next, I will check for missing values in these fields.
+<|assistant|>interpreter
+```python
+# Check for missing values in each column
+missing_values = {key: sum(1 for item in data if key not in item) for key in data[0].keys()}
+
+missing_values
+```
+<|observation|>
+```result
+{'file_name': 0, 'name': 0, 'type': 0, 'description': 0}
+```
+<|assistant|>
+Each field has no missing values. This means that each entry is complete.
+
+Next, we can check for any potential outliers. This usually requires deeper analysis and understanding of the data, but I can start by examining some basic statistics and distributions. To do this, we can check the value distribution of the "type" field to see if there are any unexpected categories.
+<|assistant|>interpreter
+```python
+# Check the distribution of values in the 'type' column
+type_distribution = {key: sum(1 for item in data if item['type'] == key) for key in set(item['type'] for item in data)}
+
+type_distribution
+```
+<|observation|>
+```result
+{'survivor': 116, 'killer': 103}
+```
+<|assistant|>
+The 'type' field has two values: 'survivor' and 'killer'. Among them, 'survivor' has 116 entries, while 'killer' has 103 entries.
+
+From this preliminary examination, we did not find any obvious abnormal values.
+
+If you have any specific standards or other fields that need to be checked, I can provide you with a more detailed analysis. What else do you want me to check?
+<|user|>
+Help me draw a heart
+<|assistant|>interpreter
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+
+# The parametric equation for the shape of a heart
+def heart(t):
+    x = 16 * np.sin(t) ** 3
+    y = 13 * np.cos(t) - 5 * np.cos(2 * t) - 2 * np.cos(3 * t) - np.cos(4 * t)
+    return x, y
+
+t = np.linspace(0, 2 * np.pi, 1000)
+x, y = heart(t)
+
+plt.figure(figsize=(6, 6))
+plt.plot(x, y, color='red')
+plt.axis('equal')
+plt.axis('off')
+plt.show()
+```
+<|observation|>
+```result
+[Image]
+```
+<|assistant|>
+This is a shape of heart. I used parametric equations to describe this shape and drew it using Matplotlib. If you have any other needs or questions, please feel free to let me know.
+<|user|>  # End
+````
--- a/README.md
+++ b/README.md
+# ChatGLM3
+
+<p align="center">
+🤗 <a href="https://huggingface.co/THUDM/chatglm3-6b" target="_blank">HF Repo</a> • 🤖 <a href="https://modelscope.cn/models/ZhipuAI/chatglm3-6b" target="_blank">ModelScope</a> • 🤖 <a href="https://www.wisemodel.cn/models/ZhipuAI/chatglm3-6b" target="_blank">WiseModel</a> • 🧰 <a href="https://openxlab.org.cn/models/hot/THUDM" target="_blank">OpenXLab</a> • 🐦 <a href="https://twitter.com/thukeg" target="_blank">Twitter</a> • 📃 <a href="https://arxiv.org/abs/2103.10360" target="_blank">[GLM@ACL 22]</a> <a href="https://github.com/THUDM/GLM" target="_blank">[GitHub]</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">[GLM-130B@ICLR 23]</a> <a href="https://github.com/THUDM/GLM-130B" target="_blank">[GitHub]</a> <br>
+</p>
+<p align="center">
+    👋 加入我们的 <a href="https://join.slack.com/t/chatglm/shared_invite/zt-25ti5uohv-A_hs~am_D3Q8XPZMpj7wwQ" target="_blank">Slack</a> 和 <a href="resources/WECHAT.md" target="_blank">微信</a>
+</p>
+<p align="center">
+📍在 <a href="https://www.chatglm.cn">chatglm.cn</a> 体验更大规模的 ChatGLM 模型。
+</p>
+
+[Read this in English.](./README_en.md)
+
+📔 关于`ChatGLM3-6B` 更为详细的使用信息，可以参考
+
+ [ChatGLM3 开放技术文档](https://lslfd0slxc.feishu.cn/wiki/WvQbwIJ9tiPAxGk8ywDck6yfnof?from=from_copylink)
+ [Bilibili video](https://www.bilibili.com/video/BV1uC4y1J7yA)
+ [YouTube video](https://www.youtube.com/watch?v=Pw9PB6R7ORA)
+
+## GLM-4 介绍
+
+我们已经发布最新的 **GLM-4** 模型，该模型在多个指标上有了新的突破，您可以在以下两个渠道体验我们的最新模型。
+
+ [智谱清言](https://www.chatglm.cn) 体验最新版 GLM-4，包括 **GLMs，All tools**等功能，下载 智谱清言 APP
+  或者使用 [网页端](https://www.chatglm.cn)。
+ [API平台](https://open.bigmodel.cn/) 新一代 API 平台已经上线，您可以直接在 API
+  平台上体验 `GLM-4`、`GLM-3-Turbo`、`CharacterGLM-3`，`CogView-3` 等新模型。
+  其中`GLM-4`、`GLM-3-Turbo`两个模型支持了 `System Prompt`、`Function Call`、 `Retrieval`、`Web_Search`等新功能，欢迎体验。
+
+ [GLM4 API 开源教程](https://github.com/MetaGLM/glm-cookbook/) GLM-4 API教程和基础应用，欢迎尝试。
+  API相关问题可以在本开源教程疑问，或者使用[GLM-4 API AI助手](https://open.bigmodel.cn/shareapp/v1/?share_code=sQwt5qyqYVaNh1O_87p8O)
+  来获得常见问题的帮助。
+
+-----
+
+## ChatGLM3 介绍
+
+**ChatGLM3** 是智谱AI和清华大学 KEG 实验室联合发布的对话预训练模型。ChatGLM3-6B 是 ChatGLM3
+系列中的开源模型，在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上，ChatGLM3-6B 引入了如下特性：
+
+1. **更强大的基础模型：** ChatGLM3-6B 的基础模型 ChatGLM3-6B-Base
+   采用了更多样的训练数据、更充分的训练步数和更合理的训练策略。在语义、数学、推理、代码、知识等不同角度的数据集上测评显示，*
+   *ChatGLM3-6B-Base 具有在 10B 以下的基础模型中最强的性能**。
+2. **更完整的功能支持：** ChatGLM3-6B 采用了全新设计的 [Prompt 格式](PROMPT.md)
+   ，除正常的多轮对话外。同时原生支持[工具调用](tools_using_demo/README.md)（Function Call）、代码执行（Code Interpreter）和
+   Agent 任务等复杂场景。
+3. **更全面的开源序列：** 除了对话模型 [ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b)
+   外，还开源了基础模型 [ChatGLM3-6B-Base](https://huggingface.co/THUDM/chatglm3-6b-base)
+   、长文本对话模型 [ChatGLM3-6B-32K](https://huggingface.co/THUDM/chatglm3-6b-32k)
+   和进一步强化了对于长文本理解能力的 [ChatGLM3-6B-128K](https://huggingface.co/THUDM/chatglm3-6b-128k)。以上所有权重对学术研究**完全开放**
+   ，在填写 [问卷](https://open.bigmodel.cn/mla/form) 进行登记后**亦允许免费商业使用**。
+
+-----
+
+ChatGLM3 开源模型旨在与开源社区一起推动大模型技术发展，恳请开发者和大家遵守 [开源协议](MODEL_LICENSE)
+，勿将开源模型和代码及基于开源项目产生的衍生物用于任何可能给国家和社会带来危害的用途以及用于任何未经过安全评估和备案的服务。目前，本项目团队未基于
+**ChatGLM3 开源模型**开发任何应用，包括网页端、安卓、苹果 iOS 及 Windows App 等应用。
+
+尽管模型在训练的各个阶段都尽力确保数据的合规性和准确性，但由于 ChatGLM3-6B
+模型规模较小，且模型受概率随机性因素影响，无法保证输出内容的准确。同时模型的输出容易被用户的输入误导。*
+*本项目不承担开源模型和代码导致的数据安全、舆情风险或发生任何模型被误导、滥用、传播、不当利用而产生的风险和责任。**
+
+## 模型列表
+                      
+|      Model       | Seq Length |                                                                                                                                              Download                                                                                                                                    |
+|:----------------:|:----------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+|   ChatGLM3-6B    |     8k     |        [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b) \| [WiseModel](https://www.wisemodel.cn/models/ZhipuAI/chatglm3-6b)         \| [OpenXLab](https://openxlab.org.cn/models/detail/THUDM/chatglm3-6b)      |
+| ChatGLM3-6B-Base |     8k     | [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-base) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base) \| [WiseModel](https://www.wisemodel.cn/models/ZhipuAI/chatglm3-6b-base) \| [OpenXLabl](https://openxlab.org.cn/models/detail/THUDM/chatglm3-6b-base)|
+| ChatGLM3-6B-32K  |    32k     |  [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-32k) \| [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k) \| [WiseModel](https://www.wisemodel.cn/models/ZhipuAI/chatglm3-6b-32k)   \| [OpenXLab](https://openxlab.org.cn/models/detail/THUDM/chatglm3-6b-32k)  |
+| ChatGLM3-6B-128K |    128k    |     [HuggingFace](https://huggingface.co/THUDM/chatglm3-6b-128k) ｜ [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-128k)\| [OpenXLab](https://openxlab.org.cn/models/detail/THUDM/chatglm3-6b-128k) |
+
+<br> 请注意，所有模型的最新更新都会在 Huggingface 率先发布。 ModelScope 和 WiseModel 由于没有与 Huggingface 同步，需要开发人员手动更新，可能会在
+Huggingface 更新后一段时间内同步更新。
+
+## 友情链接
+
+以下优秀开源仓库已经对 ChatGLM3-6B 模型深度支持，欢迎大家扩展学习。
+
+推理加速：
+
+* [chatglm.cpp](https://github.com/li-plus/chatglm.cpp): 类似 llama.cpp 的量化加速推理方案，实现笔记本上实时对话
+* [ChatGLM3-TPU](https://github.com/sophgo/ChatGLM3-TPU): 采用TPU加速推理方案，在算能端侧芯片BM1684X（16T@FP16，内存16G）上实时运行约7.5 token/s
+* [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main):
+  NVIDIA开发的高性能 GPU 加速推理方案，可以参考此 [步骤](./tensorrt_llm_demo/README.md) 部署 ChatGLM3-6B 模型
+* [OpenVINO](https://github.com/openvinotoolkit): 
+Intel 开发的高性能 CPU 和 GPU 加速推理方案，可以参考此 [步骤](./Intel_device_demo/openvino_demo/README.md) 部署 ChatGLM3-6B 模型
+
+高效微调：
+
+* [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory): 优秀易上手的高效微调框架。
+
+应用框架：
+
+* [LangChain-Chatchat](https://github.com/chatchat-space/Langchain-Chatchat): 基于 ChatGLM 等大语言模型与 Langchain
+  等应用框架实现，开源、可离线部署的检索增强生成(RAG)大模型知识库项目。
+
+* [BISHENG](https://github.com/dataelement/bisheng): 开源大模型应用开发平台,赋能和加速大模型应用开发落地，帮助用户以最佳体验进入下一代应用开发模式。
+
+## 评测结果
+
+### 典型任务
+
+我们选取了 8 个中英文典型数据集，在 ChatGLM3-6B (base) 版本上进行了性能测试。
+
+| Model            | GSM8K | MATH | BBH  | MMLU | C-Eval | CMMLU | MBPP | AGIEval |
+|------------------|:-----:|:----:|:----:|:----:|:------:|:-----:|:----:|:-------:|
+| ChatGLM2-6B-Base | 32.4  | 6.5  | 33.7 | 47.9 |  51.7  | 50.0  |  -   |    -    |
+| Best Baseline    | 52.1  | 13.1 | 45.0 | 60.1 |  63.5  | 62.2  | 47.5 |  45.8   |
+| ChatGLM3-6B-Base | 72.3  | 25.7 | 66.1 | 61.4 |  69.0  | 67.5  | 52.4 |  53.7   |
+
+> Best Baseline 指的是截止 2023年10月27日、模型参数在 10B 以下、在对应数据集上表现最好的预训练模型，不包括只针对某一项任务训练而未保持通用能力的模型。
+
+> 对 ChatGLM3-6B-Base 的测试中，BBH 采用 3-shot 测试，需要推理的 GSM8K、MATH 采用 0-shot CoT 测试，MBPP 采用 0-shot
+> 生成后运行测例计算 Pass@1 ，其他选择题类型数据集均采用 0-shot 测试。
+
+我们在多个长文本应用场景下对 ChatGLM3-6B-32K 进行了人工评估测试。与二代模型相比，其效果平均提升了超过
+50%。在论文阅读、文档摘要和财报分析等应用中，这种提升尤为显著。此外，我们还在 LongBench 评测集上对模型进行了测试，具体结果如下表所示
+
+| Model           |  平均  | Summary | Single-Doc QA | Multi-Doc QA | Code | Few-shot | Synthetic | 
+|-----------------|:----:|:-------:|:-------------:|:------------:|:----:|:--------:|:---------:|
+| ChatGLM2-6B-32K | 41.5 |  24.8   |     37.6      |     34.7     | 52.8 |   51.3   |   47.7    | 
+| ChatGLM3-6B-32K | 50.2 |  26.6   |     45.8      |     46.1     | 56.2 |   61.2   |    65     |
+
+## 使用方式
+
+### 环境安装
+
+首先需要下载本仓库：
+
+```shell
+git clone https://github.com/THUDM/ChatGLM3
+cd ChatGLM3
+```
+
+然后使用 pip 安装依赖：
+
+```
+pip install -r requirements.txt
+```
+
+ 为了保证 `torch` 的版本正确，请严格按照 [官方文档](https://pytorch.org/get-started/locally/) 的说明安装。
+
+### 综合 Demo
+
+我们提供了一个集成以下三种功能的综合 Demo，运行方法请参考[综合 Demo](composite_demo/README.md)
+
+- Chat: 对话模式，在此模式下可以与模型进行对话。
+- Tool: 工具模式，模型除了对话外，还可以通过工具进行其他操作。
+
+<img src="resources/tool.png" width="400">
+
+- Code Interpreter: 代码解释器模式，模型可以在一个 Jupyter 环境中执行代码并获取结果，以完成复杂任务。
+
+<img src="resources/heart.png" width="400">
+
+### 代码调用
+
+可以通过如下代码调用 ChatGLM 模型来生成对话：
+
+```python
+>> > from transformers import AutoTokenizer, AutoModel
+>> > tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
+>> > model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True, device='cuda')
+>> > model = model.eval()
+>> > response, history = model.chat(tokenizer, "你好", history=[])
+>> > print(response)
+你好👋!我是人工智能助手
+ChatGLM3 - 6
+B, 很高兴见到你, 欢迎问我任何问题。
+>> > response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
+>> > print(response)
+晚上睡不着可能会让你感到焦虑或不舒服, 但以下是一些可以帮助你入睡的方法:
+
+1.制定规律的睡眠时间表: 保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯, 使你更容易入睡。尽量在每天的相同时间上床, 并在同一时间起床。
+2.创造一个舒适的睡眠环境: 确保睡眠环境舒适, 安静, 黑暗且温度适宜。可以使用舒适的床上用品, 并保持房间通风。
+3.放松身心: 在睡前做些放松的活动, 例如泡个热水澡, 听些轻柔的音乐, 阅读一些有趣的书籍等, 有助于缓解紧张和焦虑, 使你更容易入睡。
+4.避免饮用含有咖啡因的饮料: 咖啡因是一种刺激性物质, 会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料, 例如咖啡, 茶和可乐。
+5.避免在床上做与睡眠无关的事情: 在床上做些与睡眠无关的事情, 例如看电影, 玩游戏或工作等, 可能会干扰你的睡眠。
+6.尝试呼吸技巧: 深呼吸是一种放松技巧, 可以帮助你缓解紧张和焦虑, 使你更容易入睡。试着慢慢吸气, 保持几秒钟, 然后缓慢呼气。
+
+如果这些方法无法帮助你入睡, 你可以考虑咨询医生或睡眠专家, 寻求进一步的建议。
+```
+
+#### 从本地加载模型
+
+以上代码会由 `transformers`
+自动下载模型实现和参数。完整的模型实现在 [Hugging Face Hub](https://huggingface.co/THUDM/chatglm3-6b)
+。如果你的网络环境较差，下载模型参数可能会花费较长时间甚至失败。此时可以先将模型下载到本地，然后从本地加载。
+
+从 Hugging Face Hub
+下载模型需要先[安装Git LFS](https://docs.github.com/zh/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)
+，然后运行
+
+```Shell
+git clone https://huggingface.co/THUDM/chatglm3-6b
+```
+
+如果从你从 HuggingFace 下载比较慢，也可以从 [ModelScope](https://modelscope.cn/models/ZhipuAI/chatglm3-6b)
+中下载。
+
+### 模型微调
+
+我们提供了一个微调 ChatGLM3-6B 模型的基础套件，可以用来微调 ChatGLM3-6B 模型。微调套件的使用方法请参考
+[微调套件](finetune_demo/README.md)。
+
+### 网页版对话 Demo
+
+![web-demo](resources/web-demo.gif)
+可以通过以下命令启动基于 Gradio 的网页版 demo：
+
+```shell
+python web_demo_gradio.py
+```
+
+![web-demo](resources/web-demo2.png)
+
+可以通过以下命令启动基于 Streamlit 的网页版 demo：
+
+```shell
+streamlit run web_demo_streamlit.py
+```
+
+网页版 demo 会运行一个 Web Server，并输出地址。在浏览器中打开输出的地址即可使用。 经测试，基于 Streamlit 的网页版 Demo 会更流畅。
+
+### 命令行对话 Demo
+
+![cli-demo](resources/cli-demo.png)
+
+运行仓库中 [cli_demo.py](basic_demo/cli_demo.py)：
+
+```shell
+python cli_demo.py
+```
+
+程序会在命令行中进行交互式的对话，在命令行中输入指示并回车即可生成回复，输入 `clear` 可以清空对话历史，输入 `stop` 终止程序。
+
+### LangChain Demo
+
+代码实现请参考 [LangChain Demo](langchain_demo/README.md)。
+
+#### 工具调用
+
+关于工具调用的方法请参考 [工具调用](tools_using_demo/README.md)。
+
+#### OpenAI API / Zhipu API Demo
+
+我们已经推出了 OpenAI / ZhipuAI 格式的 开源模型 API 部署代码，可以作为任意基于 ChatGPT 的应用的后端。
+目前，可以通过运行仓库中的 [api_server.py](openai_api_demo/api_server.py) 进行部署
+
+```shell
+cd openai_api_demo
+python api_server.py
+```
+
+同时，我们也书写了一个示例代码，用来测试API调用的性能。
+
+ OpenAI 测试脚本：[openai_api_request.py](openai_api_demo/openai_api_request.py)
+ ZhipuAI 测试脚本：[zhipu_api_request.py](openai_api_demo/zhipu_api_request.py)
+ 使用Curl进行测试
+  
+ chat Curl 测试
+```shell
+curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d "{\"model\": \"chatglm3-6b\", \"messages\": [{\"role\": \"system\", \"content\": \"You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.\"}, {\"role\": \"user\", \"content\": \"你好，给我讲一个故事，大概100字\"}], \"stream\": false, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}"
+````
+ agent-chat Curl 测试
+```shell
+curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d "{\"model\": \"chatglm3-6b\", \"agent\": true, \"messages\": [{\"role\": \"user\", \"content\": \"37乘以8加7除2等于多少？\"}], \"stream\": true, \"max_tokens\": 100, \"temperature\": 0.8, \"top_p\": 0.8}"
+````
+
+ 使用Python进行测试
+
+```shell
+cd openai_api_demo
+python openai_api_request.py
+```
+
+如果测试成功，则模型应该返回一段故事。
+
+## 低成本部署
+
+### 模型量化
+
+默认情况下，模型以 FP16 精度加载，运行上述代码需要大概 13GB 显存。如果你的 GPU 显存有限，可以尝试以量化方式加载模型，使用方法如下：
+
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).quantize(4).cuda()
+```
+
+模型量化会带来一定的性能损失，经过测试，ChatGLM3-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。
+
+### CPU 部署
+
+如果你没有 GPU 硬件的话，也可以在 CPU 上进行推理，但是推理速度会更慢。使用方法如下（需要大概 32GB 内存）
+
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).float()
+```
+
+### Mac 部署
+
+对于搭载了 Apple Silicon 或者 AMD GPU 的 Mac，可以使用 MPS 后端来在 GPU 上运行 ChatGLM3-6B。需要参考 Apple
+的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly（正确的版本号应该是2.x.x.dev2023xxxx，而不是
+2.x.x）。
+
+目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载，并使用 mps 后端：
+
+```python
+model = AutoModel.from_pretrained("your local path", trust_remote_code=True).to('mps')
+```
+
+加载半精度的 ChatGLM3-6B 模型需要大概 13GB 内存。内存较小的机器（比如 16GB 内存的 MacBook
+Pro），在空余内存不足的情况下会使用硬盘上的虚拟内存，导致推理速度严重变慢。
+
+### 多卡部署
+
+如果你有多张 GPU，但是每张 GPU 的显存大小都不足以容纳完整的模型，那么可以将模型切分在多张GPU上。首先安装
+accelerate: `pip install accelerate`，然后即可正常加载模型。
+
+### OpenVINO Demo
+
+ChatGLM3-6B 已经支持使用 OpenVINO
+工具包进行加速推理，在英特尔的GPU和GPU设备上有较大推理速度提升。具体使用方法请参考 [OpenVINO Demo](Intel_device_demo/openvino_demo/README.md)。
+
+### TensorRT-LLM Demo
+
+ChatGLM3-6B已经支持使用 TensorRT-LLM
+工具包进行加速推理，模型推理速度得到多倍的提升。具体使用方法请参考 [TensorRT-LLM Demo](tensorrt_llm_demo/tensorrt_llm_cli_demo.py)
+和 官方技术文档。
+
+## 引用
+
+如果你觉得我们的工作有帮助的话，请考虑引用下列论文。
+
+```
+@article{zeng2022glm,
+  title={Glm-130b: An open bilingual pre-trained model},
+  author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and others},
+  journal={arXiv preprint arXiv:2210.02414},
+  year={2022}
+}
+```
+
+```
+@inproceedings{du2022glm,
+  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},
+  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},
+  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
+  pages={320--335},
+  year={2022}
+}
+```