Delete some codes about vllm

599cfae1 · Rayyyyy · 7f9c28a1 · 7f9c28a1 · 7f9c28a1 · 7f9c28a1
Commit 599cfae1 authored Jul 26, 2024 by Rayyyyy
7 changed files
--- a/basic_demo/README.md
+++ b/basic_demo/README.md
-# Basic Demo
-Read this in [English](README_en.md)
-本 demo 中，你将体验到如何使用 GLM-4-9B 开源模型进行基本的任务。
-请严格按照文档的步骤进行操作，以避免不必要的错误。
-## 设备和依赖检查
-### 相关推理测试数据
-**本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同，请以实际运行环境为准。**
-测试硬件信息:
-+ OS: Ubuntu 22.04
-+ Memory: 512GB
-+ Python: 3.12.3
-+ CUDA Version:  12.3
-+ GPU Driver: 535.104.05
-+ GPU: NVIDIA A100-SXM4-80GB * 8
-相关推理的压力测试数据如下：
-**所有测试均在单张GPU上进行测试,所有显存消耗都按照峰值左右进行测算**
-#### GLM-4-9B-Chat
-| 精度   | 显存占用     | Prefilling / 首响 | Decode Speed     | Remarks      |
-|------|----------|-----------------|------------------|--------------|
-| BF16 | 19047MiB | 0.1554s         | 27.8193 tokens/s | 输入长度为 1000   |
-| BF16 | 20629MiB | 0.8199s         | 31.8613 tokens/s | 输入长度为 8000   |
-| BF16 | 27779MiB | 4.3554s         | 14.4108 tokens/s | 输入长度为 32000  |
-| BF16 | 57379MiB | 38.1467s        | 3.4205  tokens/s | 输入长度为 128000 |
-| 精度   | 显存占用     | Prefilling / 首响 | Decode Speed     | Remarks     |
-|------|----------|-----------------|------------------|-------------|
-| Int4 | 8251MiB  | 0.1667s         | 23.3903 tokens/s | 输入长度为 1000  |
-| Int4 | 9613MiB  | 0.8629s         | 23.4248 tokens/s | 输入长度为 8000  |
-| Int4 | 16065MiB | 4.3906s         | 14.6553 tokens/s | 输入长度为 32000 |
-### GLM-4-9B-Chat-1M
-| 精度   | 显存占用     | Prefilling / 首响 | Decode Speed     | Remarks      |
-|------|----------|-----------------|------------------|--------------|
-| BF16 | 74497MiB | 98.4930s        | 2.3653  tokens/s | 输入长度为 200000 |
-如果您的输入超过200K，我们建议您使用vLLM后端进行多卡推理，以获得更好的性能。
-#### GLM-4V-9B
-| 精度   | 显存占用     | Prefilling / 首响 | Decode Speed     | Remarks    |
-|------|----------|-----------------|------------------|------------|
-| BF16 | 28131MiB | 0.1016s         | 33.4660 tokens/s | 输入长度为 1000 |
-| BF16 | 33043MiB | 0.7935a         | 39.2444 tokens/s | 输入长度为 8000 |
-| 精度   | 显存占用     | Prefilling / 首响 | Decode Speed     | Remarks    |
-|------|----------|-----------------|------------------|------------|
-| Int4 | 10267MiB | 0.1685a         | 28.7101 tokens/s | 输入长度为 1000 |
-| Int4 | 14105MiB | 0.8629s         | 24.2370 tokens/s | 输入长度为 8000 |
-### 最低硬件要求
-如果您希望运行官方提供的最基础代码 (transformers 后端) 您需要：
-+ Python >= 3.10
-+ 内存不少于 32 GB
-如果您希望运行官方提供的本文件夹的所有代码，您还需要：
-+ Linux 操作系统 (Debian 系列最佳)
-+ 大于 8GB 显存的，支持 CUDA 或者 ROCM 并且支持 `BF16` 推理的 GPU 设备 (A100以上GPU，V100，20以及更老的GPU架构不受支持)
-安装依赖
-```shell
-pip install -r requirements.txt
-```
-## 基础功能调用
-**除非特殊说明，本文件夹所有 demo 并不支持 Function Call 和 All Tools 等进阶用法**
-### 使用 transformers 后端代码
-+ 使用命令行与 GLM-4-9B 模型进行对话。
-```shell
-python trans_cli_demo.py # GLM-4-9B-Chat
-python trans_cli_vision_demo.py # GLM-4V-9B
-```
-+ 使用 Gradio 网页端与 GLM-4-9B-Chat 模型进行对话。
-```shell
-python trans_web_demo.py
-```
-+ 使用 Batch 推理。
-```shell
-python cli_batch_request_demo.py
-```
-### 使用 vLLM 后端代码
-+ 使用命令行与 GLM-4-9B-Chat 模型进行对话。
-```shell
-python vllm_cli_demo.py
-```
-+ 自行构建服务端，并使用 `OpenAI API` 的请求格式与 GLM-4-9B-Chat 模型进行对话。本 demo 支持 Function Call 和 All Tools功能。
-启动服务端：
-```shell
-python openai_api_server.py
-```
-客户端请求：
-```shell
-python openai_api_request.py
-```
-## 压力测试
-用户可以在自己的设备上使用本代码测试模型在 transformers后端的生成速度:
-```shell
-python trans_stress_test.py
-```
--- a/basic_demo/README_en.md
+++ b/basic_demo/README_en.md
-# Basic Demo
-In this demo, you will experience how to use the GLM-4-9B open source model to perform basic tasks.
-Please follow the steps in the document strictly to avoid unnecessary errors.
-## Device and dependency check
-### Related inference test data
-**The data in this document are tested in the following hardware environment. The actual operating environment
-requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
-environment. **
-Test hardware information:
-+ OS: Ubuntu 22.04
-+ Memory: 512GB
-+ Python: 3.12.3
-+ CUDA Version: 12.3
-+ GPU Driver: 535.104.05
-+ GPU: NVIDIA A100-SXM4-80GB * 8
-The stress test data of relevant inference are as follows:
-**All tests are performed on a single GPU, and all video memory consumption is calculated based on the peak value**
-#
-### GLM-4-9B-Chat
-| Dtype | GPU Memory | Prefilling | Decode Speed     | Remarks                |
-|-------|------------|------------|------------------|------------------------|
-| BF16  | 19047MiB   | 0.1554s    | 27.8193 tokens/s | Input length is 1000   |
-| BF16  | 20629MiB   | 0.8199s    | 31.8613 tokens/s | Input length is 8000   |
-| BF16  | 27779MiB   | 4.3554s    | 14.4108 tokens/s | Input length is 32000  |
-| BF16  | 57379MiB   | 38.1467s   | 3.4205  tokens/s | Input length is 128000 |
-| Dtype | GPU Memory | Prefilling | Decode Speed     | Remarks               |
-|-------|------------|------------|------------------|-----------------------|
-| Int4  | 8251MiB    | 0.1667s    | 23.3903 tokens/s | Input length is 1000  |
-| Int4  | 9613MiB    | 0.8629s    | 23.4248 tokens/s | Input length is 8000  |
-| Int4  | 16065MiB   | 4.3906s    | 14.6553 tokens/s | Input length is 32000 |
-### GLM-4-9B-Chat-1M
-| Dtype | GPU Memory | Prefilling | Decode Speed     | Remarks      |
-|-------|------------|------------|------------------|--------------|
-| BF16  | 74497MiB   | 98.4930s   | 2.3653  tokens/s | 输入长度为 200000 |
-If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better performance.
-#### GLM-4V-9B
-| Dtype | GPU Memory | Prefilling | Decode Speed     | Remarks              |
-|-------|------------|------------|------------------|----------------------|
-| BF16  | 28131MiB   | 0.1016s    | 33.4660 tokens/s | Input length is 1000 |
-| BF16  | 33043MiB   | 0.7935a    | 39.2444 tokens/s | Input length is 8000 |
-| Dtype | GPU Memory | Prefilling | Decode Speed     | Remarks              |
-|-------|------------|------------|------------------|----------------------|
-| Int4  | 10267MiB   | 0.1685a    | 28.7101 tokens/s | Input length is 1000 |
-| Int4  | 14105MiB   | 0.8629s    | 24.2370 tokens/s | Input length is 8000 |
-### Minimum hardware requirements
-If you want to run the most basic code provided by the official (transformers backend) you need:
-+ Python >= 3.10
-+ Memory of at least 32 GB
-If you want to run all the codes in this folder provided by the official, you also need:
-+ Linux operating system (Debian series is best)
-+ GPU device with more than 8GB video memory, supporting CUDA or ROCM and supporting `BF16` reasoning (GPUs above A100,
-  V100, 20 and older GPU architectures are not supported)
-Install dependencies
-```shell
-pip install -r requirements.txt
-```
-## Basic function calls
-**Unless otherwise specified, all demos in this folder do not support advanced usage such as Function Call and All Tools
-**
-### Use transformers backend code
-+ Use the command line to communicate with the GLM-4-9B model.
-```shell
-python trans_cli_demo.py # GLM-4-9B-Chat
-python trans_cli_vision_demo.py # GLM-4V-9B
-```
-+ Use the Gradio web client to communicate with the GLM-4-9B-Chat model.
-```shell
-python trans_web_demo.py
-```
-+ Use Batch inference.
-```shell
-python cli_batch_request_demo.py
-```
-### Use vLLM backend code
-+ Use the command line to communicate with the GLM-4-9B-Chat model.
-```shell
-python vllm_cli_demo.py
-```
-+ Build the server by yourself and use the request format of `OpenAI API` to communicate with the glm-4-9b model. This
-  demo supports Function Call and All Tools functions.
-Start the server:
-```shell
-python openai_api_server.py
-```
-Client request:
-```shell
-python openai_api_request.py
-```
-## Stress test
-Users can use this code to test the generation speed of the model on the transformers backend on their own devices:
-```shell
-python trans_stress_test.py
-```
\ No newline at end of file
--- a/basic_demo/openai_api_request.py
+++ b/basic_demo/openai_api_request.py
-"""
-This script creates a OpenAI Request demo for the glm-4-9b model, just Use OpenAI API to interact with the model.
-"""
-from openai import OpenAI
-base_url = "http://127.0.0.1:8000/v1/"
-client = OpenAI(api_key="EMPTY", base_url=base_url)
-def function_chat():
-    messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
-    tools = [
-        {
-            "type": "function",
-            "function": {
-                "name": "get_current_weather",
-                "description": "Get the current weather in a given location",
-                "parameters": {
-                    "type": "object",
-                    "properties": {
-                        "location": {
-                            "type": "string",
-                            "description": "The city and state, e.g. San Francisco, CA",
-                        },
-                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
-                    },
-                    "required": ["location"],
-                },
-            },
-        }
-    ]
-    # All Tools 能力: 绘图
-    # messages = [{"role": "user", "content": "帮我画一张天空的画画吧"}]
-    # tools = [{"type": "cogview"}]
-    #
-    # All Tools 能力: 联网查询
-    # messages = [{"role": "user", "content": "今天黄金的价格"}]
-    # tools = [{"type": "simple_browser"}]
-    response = client.chat.completions.create(
-        model="glm-4",
-        messages=messages,
-        tools=tools,
-        tool_choice="auto",  # use "auto" to let the model choose the tool automatically
-        # tool_choice={"type": "function", "function": {"name": "my_function"}},
-    )
-    if response:
-        content = response.choices[0].message.content
-        print(content)
-    else:
-        print("Error:", response.status_code)
-def simple_chat(use_stream=False):
-    messages = [
-        {
-            "role": "system",
-            "content": "你是 GLM-4，请你热情回答用户的问题。",
-        },
-        {
-            "role": "user",
-            "content": "你好，请你用生动的话语给我讲一个小故事吧"
-        }
-    ]
-    response = client.chat.completions.create(
-        model="glm-4",
-        messages=messages,
-        stream=use_stream,
-        max_tokens=1024,
-        temperature=0.8,
-        presence_penalty=1.1,
-        top_p=0.8)
-    if response:
-        if use_stream:
-            for chunk in response:
-                print(chunk.choices[0].delta.content)
-        else:
-            content = response.choices[0].message.content
-            print(content)
-    else:
-        print("Error:", response.status_code)
-if __name__ == "__main__":
-    simple_chat()
-    function_chat()
--- a/basic_demo/openai_api_server.py
+++ b/basic_demo/openai_api_server.py
-import os
-import time
-from asyncio.log import logger
-import uvicorn
-import gc
-import json
-import torch
-from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine
-from fastapi import FastAPI, HTTPException, Response
-from fastapi.middleware.cors import CORSMiddleware
-from contextlib import asynccontextmanager
-from typing import List, Literal, Optional, Union
-from pydantic import BaseModel, Field
-from transformers import AutoTokenizer, LogitsProcessor
-from sse_starlette.sse import EventSourceResponse
-EventSourceResponse.DEFAULT_PING_INTERVAL = 1000
-MODEL_PATH = 'THUDM/glm-4-9b-chat'
-MAX_MODEL_LENGTH = 8192
-@asynccontextmanager
-async def lifespan(app: FastAPI):
-    yield
-    if torch.cuda.is_available():
-        torch.cuda.empty_cache()
-        torch.cuda.ipc_collect()
-app = FastAPI(lifespan=lifespan)
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=["*"],
-    allow_credentials=True,
-    allow_methods=["*"],
-    allow_headers=["*"],
-)
-class ModelCard(BaseModel):
-    id: str
-    object: str = "model"
-    created: int = Field(default_factory=lambda: int(time.time()))
-    owned_by: str = "owner"
-    root: Optional[str] = None
-    parent: Optional[str] = None
-    permission: Optional[list] = None
-class ModelList(BaseModel):
-    object: str = "list"
-    data: List[ModelCard] = []
-class FunctionCallResponse(BaseModel):
-    name: Optional[str] = None
-    arguments: Optional[str] = None
-class ChatMessage(BaseModel):
-    role: Literal["user", "assistant", "system", "tool"]
-    content: str = None
-    name: Optional[str] = None
-    function_call: Optional[FunctionCallResponse] = None
-class DeltaMessage(BaseModel):
-    role: Optional[Literal["user", "assistant", "system"]] = None
-    content: Optional[str] = None
-    function_call: Optional[FunctionCallResponse] = None
-class EmbeddingRequest(BaseModel):
-    input: Union[List[str], str]
-    model: str
-class CompletionUsage(BaseModel):
-    prompt_tokens: int
-    completion_tokens: int
-    total_tokens: int
-class EmbeddingResponse(BaseModel):
-    data: list
-    model: str
-    object: str
-    usage: CompletionUsage
-class UsageInfo(BaseModel):
-    prompt_tokens: int = 0
-    total_tokens: int = 0
-    completion_tokens: Optional[int] = 0
-class ChatCompletionRequest(BaseModel):
-    model: str
-    messages: List[ChatMessage]
-    temperature: Optional[float] = 0.8
-    top_p: Optional[float] = 0.8
-    max_tokens: Optional[int] = None
-    stream: Optional[bool] = False
-    tools: Optional[Union[dict, List[dict]]] = None
-    tool_choice: Optional[Union[str, dict]] = "None"
-    repetition_penalty: Optional[float] = 1.1
-class ChatCompletionResponseChoice(BaseModel):
-    index: int
-    message: ChatMessage
-    finish_reason: Literal["stop", "length", "function_call"]
-class ChatCompletionResponseStreamChoice(BaseModel):
-    delta: DeltaMessage
-    finish_reason: Optional[Literal["stop", "length", "function_call"]]
-    index: int
-class ChatCompletionResponse(BaseModel):
-    model: str
-    id: str
-    object: Literal["chat.completion", "chat.completion.chunk"]
-    choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
-    created: Optional[int] = Field(default_factory=lambda: int(time.time()))
-    usage: Optional[UsageInfo] = None
-class InvalidScoreLogitsProcessor(LogitsProcessor):
-    def __call__(
-            self, input_ids: torch.LongTensor, scores: torch.FloatTensor
-    ) -> torch.FloatTensor:
-        if torch.isnan(scores).any() or torch.isinf(scores).any():
-            scores.zero_()
-            scores[..., 5] = 5e4
-        return scores
-def process_response(output: str, use_tool: bool = False) -> Union[str, dict]:
-    content = ""
-    for response in output.split("<|assistant|>"):
-        if "\n" in response:
-            metadata, content = response.split("\n", maxsplit=1)
-        else:
-            metadata, content = "", response
-        if not metadata.strip():
-            content = content.strip()
-        else:
-            if use_tool:
-                parameters = eval(content.strip())
-                content = {
-                    "name": metadata.strip(),
-                    "arguments": json.dumps(parameters, ensure_ascii=False)
-                }
-            else:
-                content = {
-                    "name": metadata.strip(),
-                    "content": content
-                }
-    return content
-@torch.inference_mode()
-async def generate_stream_glm4(params):
-    messages = params["messages"]
-    tools = params["tools"]
-    tool_choice = params["tool_choice"]
-    temperature = float(params.get("temperature", 1.0))
-    repetition_penalty = float(params.get("repetition_penalty", 1.0))
-    top_p = float(params.get("top_p", 1.0))
-    max_new_tokens = int(params.get("max_tokens", 8192))
-    messages = process_messages(messages, tools=tools, tool_choice=tool_choice)
-    inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
-    params_dict = {
-        "n": 1,
-        "best_of": 1,
-        "presence_penalty": 1.0,
-        "frequency_penalty": 0.0,
-        "temperature": temperature,
-        "top_p": top_p,
-        "top_k": -1,
-        "repetition_penalty": repetition_penalty,
-        "use_beam_search": False,
-        "length_penalty": 1,
-        "early_stopping": False,
-        "stop_token_ids": [151329, 151336, 151338],
-        "ignore_eos": False,
-        "max_tokens": max_new_tokens,
-        "logprobs": None,
-        "prompt_logprobs": None,
-        "skip_special_tokens": True,
-    }
-    sampling_params = SamplingParams(**params_dict)
-    async for output in engine.generate(inputs=inputs, sampling_params=sampling_params, request_id="glm-4-9b"):
-        output_len = len(output.outputs[0].token_ids)
-        input_len = len(output.prompt_token_ids)
-        ret = {
-            "text": output.outputs[0].text,
-            "usage": {
-                "prompt_tokens": input_len,
-                "completion_tokens": output_len,
-                "total_tokens": output_len + input_len
-            },
-            "finish_reason": output.outputs[0].finish_reason,
-        }
-        yield ret
-    gc.collect()
-    torch.cuda.empty_cache()
-def process_messages(messages, tools=None, tool_choice="none"):
-    _messages = messages
-    messages = []
-    msg_has_sys = False
-    def filter_tools(tool_choice, tools):
-        function_name = tool_choice.get('function', {}).get('name', None)
-        if not function_name:
-            return []
-        filtered_tools = [
-            tool for tool in tools
-            if tool.get('function', {}).get('name') == function_name
-        ]
-        return filtered_tools
-    if tool_choice != "none":
-        if isinstance(tool_choice, dict):
-            tools = filter_tools(tool_choice, tools)
-        if tools:
-            messages.append(
-                {
-                    "role": "system",
-                    "content": None,
-                    "tools": tools
-                }
-            )
-        msg_has_sys = True
-    # add to metadata
-    if isinstance(tool_choice, dict) and tools:
-        messages.append(
-            {
-                "role": "assistant",
-                "metadata": tool_choice["function"]["name"],
-                "content": ""
-            }
-        )
-    for m in _messages:
-        role, content, func_call = m.role, m.content, m.function_call
-        if role == "function":
-            messages.append(
-                {
-                    "role": "observation",
-                    "content": content
-                }
-            )
-        elif role == "assistant" and func_call is not None:
-            for response in content.split("<|assistant|>"):
-                if "\n" in response:
-                    metadata, sub_content = response.split("\n", maxsplit=1)
-                else:
-                    metadata, sub_content = "", response
-                messages.append(
-                    {
-                        "role": role,
-                        "metadata": metadata,
-                        "content": sub_content.strip()
-                    }
-                )
-        else:
-            if role == "system" and msg_has_sys:
-                msg_has_sys = False
-                continue
-            messages.append({"role": role, "content": content})
-    return messages
-@app.get("/health")
-async def health() -> Response:
-    """Health check."""
-    return Response(status_code=200)
-@app.get("/v1/models", response_model=ModelList)
-async def list_models():
-    model_card = ModelCard(id="glm-4")
-    return ModelList(data=[model_card])
-@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
-async def create_chat_completion(request: ChatCompletionRequest):
-    if len(request.messages) < 1 or request.messages[-1].role == "assistant":
-        raise HTTPException(status_code=400, detail="Invalid request")
-    gen_params = dict(
-        messages=request.messages,
-        temperature=request.temperature,
-        top_p=request.top_p,
-        max_tokens=request.max_tokens or 1024,
-        echo=False,
-        stream=request.stream,
-        repetition_penalty=request.repetition_penalty,
-        tools=request.tools,
-        tool_choice=request.tool_choice,
-    )
-    logger.debug(f"==== request ====\n{gen_params}")
-    if request.stream:
-        predict_stream_generator = predict_stream(request.model, gen_params)
-        output = await anext(predict_stream_generator)
-        if output:
-            return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")
-        logger.debug(f"First result output：\n{output}")
-        function_call = None
-        if output and request.tools:
-            try:
-                function_call = process_response(output, use_tool=True)
-            except:
-                logger.warning("Failed to parse tool call")
-        # CallFunction
-        if isinstance(function_call, dict):
-            function_call = FunctionCallResponse(**function_call)
-            tool_response = ""
-            if not gen_params.get("messages"):
-                gen_params["messages"] = []
-            gen_params["messages"].append(ChatMessage(role="assistant", content=output))
-            gen_params["messages"].append(ChatMessage(role="tool", name=function_call.name, content=tool_response))
-            generate = predict(request.model, gen_params)
-            return EventSourceResponse(generate, media_type="text/event-stream")
-        else:
-            generate = parse_output_text(request.model, output)
-            return EventSourceResponse(generate, media_type="text/event-stream")
-    response = ""
-    async for response in generate_stream_glm4(gen_params):
-        pass
-    if response["text"].startswith("\n"):
-        response["text"] = response["text"][1:]
-    response["text"] = response["text"].strip()
-    usage = UsageInfo()
-    function_call, finish_reason = None, "stop"
-    if request.tools:
-        try:
-            function_call = process_response(response["text"], use_tool=True)
-        except:
-            logger.warning(
-                "Failed to parse tool call, maybe the response is not a function call(such as cogview drawing) or have been answered.")
-    if isinstance(function_call, dict):
-        finish_reason = "function_call"
-        function_call = FunctionCallResponse(**function_call)
-    message = ChatMessage(
-        role="assistant",
-        content=response["text"],
-        function_call=function_call if isinstance(function_call, FunctionCallResponse) else None,
-    )
-    logger.debug(f"==== message ====\n{message}")
-    choice_data = ChatCompletionResponseChoice(
-        index=0,
-        message=message,
-        finish_reason=finish_reason,
-    )
-    task_usage = UsageInfo.model_validate(response["usage"])
-    for usage_key, usage_value in task_usage.model_dump().items():
-        setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)
-    return ChatCompletionResponse(
-        model=request.model,
-        id="",  # for open_source model, id is empty
-        choices=[choice_data],
-        object="chat.completion",
-        usage=usage
-    )
-async def predict(model_id: str, params: dict):
-    choice_data = ChatCompletionResponseStreamChoice(
-        index=0,
-        delta=DeltaMessage(role="assistant"),
-        finish_reason=None
-    )
-    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
-    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
-    previous_text = ""
-    async for new_response in generate_stream_glm4(params):
-        decoded_unicode = new_response["text"]
-        delta_text = decoded_unicode[len(previous_text):]
-        previous_text = decoded_unicode
-        finish_reason = new_response["finish_reason"]
-        if len(delta_text) == 0 and finish_reason != "function_call":
-            continue
-        function_call = None
-        if finish_reason == "function_call":
-            try:
-                function_call = process_response(decoded_unicode, use_tool=True)
-            except:
-                logger.warning(
-                    "Failed to parse tool call, maybe the response is not a tool call or have been answered.")
-        if isinstance(function_call, dict):
-            function_call = FunctionCallResponse(**function_call)
-        delta = DeltaMessage(
-            content=delta_text,
-            role="assistant",
-            function_call=function_call if isinstance(function_call, FunctionCallResponse) else None,
-        )
-        choice_data = ChatCompletionResponseStreamChoice(
-            index=0,
-            delta=delta,
-            finish_reason=finish_reason
-        )
-        chunk = ChatCompletionResponse(
-            model=model_id,
-            id="",
-            choices=[choice_data],
-            object="chat.completion.chunk"
-        )
-        yield "{}".format(chunk.model_dump_json(exclude_unset=True))
-    choice_data = ChatCompletionResponseStreamChoice(
-        index=0,
-        delta=DeltaMessage(),
-        finish_reason="stop"
-    )
-    chunk = ChatCompletionResponse(
-        model=model_id,
-        id="",
-        choices=[choice_data],
-        object="chat.completion.chunk"
-    )
-    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
-    yield '[DONE]'
-async def predict_stream(model_id, gen_params):
-    output = ""
-    is_function_call = False
-    has_send_first_chunk = False
-    async  for new_response in generate_stream_glm4(gen_params):
-        decoded_unicode = new_response["text"]
-        delta_text = decoded_unicode[len(output):]
-        output = decoded_unicode
-        if not is_function_call and len(output) > 7:
-            is_function_call = output and 'get_' in output
-            if is_function_call:
-                continue
-            finish_reason = new_response["finish_reason"]
-            if not has_send_first_chunk:
-                message = DeltaMessage(
-                    content="",
-                    role="assistant",
-                    function_call=None,
-                )
-                choice_data = ChatCompletionResponseStreamChoice(
-                    index=0,
-                    delta=message,
-                    finish_reason=finish_reason
-                )
-                chunk = ChatCompletionResponse(
-                    model=model_id,
-                    id="",
-                    choices=[choice_data],
-                    created=int(time.time()),
-                    object="chat.completion.chunk"
-                )
-                yield "{}".format(chunk.model_dump_json(exclude_unset=True))
-            send_msg = delta_text if has_send_first_chunk else output
-            has_send_first_chunk = True
-            message = DeltaMessage(
-                content=send_msg,
-                role="assistant",
-                function_call=None,
-            )
-            choice_data = ChatCompletionResponseStreamChoice(
-                index=0,
-                delta=message,
-                finish_reason=finish_reason
-            )
-            chunk = ChatCompletionResponse(
-                model=model_id,
-                id="",
-                choices=[choice_data],
-                created=int(time.time()),
-                object="chat.completion.chunk"
-            )
-            yield "{}".format(chunk.model_dump_json(exclude_unset=True))
-    if is_function_call:
-        yield output
-    else:
-        yield '[DONE]'
-async def parse_output_text(model_id: str, value: str):
-    choice_data = ChatCompletionResponseStreamChoice(
-        index=0,
-        delta=DeltaMessage(role="assistant", content=value),
-        finish_reason=None
-    )
-    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
-    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
-    choice_data = ChatCompletionResponseStreamChoice(
-        index=0,
-        delta=DeltaMessage(),
-        finish_reason="stop"
-    )
-    chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
-    yield "{}".format(chunk.model_dump_json(exclude_unset=True))
-    yield '[DONE]'
-if __name__ == "__main__":
-    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
-    engine_args = AsyncEngineArgs(
-        model=MODEL_PATH,
-        tokenizer=MODEL_PATH,
-        tensor_parallel_size=1,
-        dtype="bfloat16",
-        trust_remote_code=True,
-        gpu_memory_utilization=0.9,
-        enforce_eager=True,
-        worker_use_ray=True,
-        engine_use_ray=False,
-        disable_log_requests=True,
-        max_model_len=MAX_MODEL_LENGTH,
-    )
-    engine = AsyncLLMEngine.from_engine_args(engine_args)
-    uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
--- a/basic_demo/trans_batch_demo.py
+++ b/basic_demo/trans_batch_demo.py
-"""
-Here is an example of using batch request glm-4-9b,
-here you need to build the conversation format yourself and then call the batch function to make batch requests.
-Please note that in this demo, the memory consumption is significantly higher.
-"""
-from typing import Optional, Union
-from transformers import AutoModel, AutoTokenizer, LogitsProcessorList
-MODEL_PATH = 'THUDM/glm-4-9b-chat'
-tokenizer = AutoTokenizer.from_pretrained(
-    MODEL_PATH,
-    trust_remote_code=True,
-    encode_special_tokens=True)
-model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, device_map="auto").eval()
-def process_model_outputs(inputs, outputs, tokenizer):
-    responses = []
-    for input_ids, output_ids in zip(inputs.input_ids, outputs):
-        response = tokenizer.decode(output_ids[len(input_ids):], skip_special_tokens=True).strip()
-        responses.append(response)
-    return responses
-def batch(
-        model,
-        tokenizer,
-        messages: Union[str, list[str]],
-        max_input_tokens: int = 8192,
-        max_new_tokens: int = 8192,
-        num_beams: int = 1,
-        do_sample: bool = True,
-        top_p: float = 0.8,
-        temperature: float = 0.8,
-        logits_processor: Optional[LogitsProcessorList] = LogitsProcessorList(),
-):
-    messages = [messages] if isinstance(messages, str) else messages
-    batched_inputs = tokenizer(messages, return_tensors="pt", padding="max_length", truncation=True,
-                               max_length=max_input_tokens).to(model.device)
-    gen_kwargs = {
-        "max_new_tokens": max_new_tokens,
-        "num_beams": num_beams,
-        "do_sample": do_sample,
-        "top_p": top_p,
-        "temperature": temperature,
-        "logits_processor": logits_processor,
-        "eos_token_id": model.config.eos_token_id
-    }
-    batched_outputs = model.generate(**batched_inputs, **gen_kwargs)
-    batched_response = process_model_outputs(batched_inputs, batched_outputs, tokenizer)
-    return batched_response
-if __name__ == "__main__":
-    batch_message = [
-        [
-            {"role": "user", "content": "我的爸爸和妈妈结婚为什么不能带我去"},
-            {"role": "assistant", "content": "因为他们结婚时你还没有出生"},
-            {"role": "user", "content": "我刚才的提问是"}
-        ],
-        [
-            {"role": "user", "content": "你好，你是谁"}
-        ]
-    ]
-    batch_inputs = []
-    max_input_tokens = 1024
-    for i, messages in enumerate(batch_message):
-        new_batch_input = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
-        max_input_tokens = max(max_input_tokens, len(new_batch_input))
-        batch_inputs.append(new_batch_input)
-    gen_kwargs = {
-        "max_input_tokens": max_input_tokens,
-        "max_new_tokens": 8192,
-        "do_sample": True,
-        "top_p": 0.8,
-        "temperature": 0.8,
-        "num_beams": 1,
-    }
-    batch_responses = batch(model, tokenizer, batch_inputs, **gen_kwargs)
-    for response in batch_responses:
-        print("=" * 10)
-        print(response)
--- a/basic_demo/trans_stress_test.py
+++ b/basic_demo/trans_stress_test.py
-import argparse
-import time
-from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer, BitsAndBytesConfig
-import torch
-from threading import Thread
-MODEL_PATH = 'THUDM/glm-4-9b-chat'
-def stress_test(token_len, n, num_gpu):
-    device = torch.device(f"cuda:{num_gpu - 1}" if torch.cuda.is_available() and num_gpu > 0 else "cpu")
-    tokenizer = AutoTokenizer.from_pretrained(
-        MODEL_PATH,
-        trust_remote_code=True,
-        padding_side="left"
-    )
-    model = AutoModelForCausalLM.from_pretrained(
-        MODEL_PATH,
-        trust_remote_code=True,
-        # quantization_config=BitsAndBytesConfig(load_in_4bit=True),
-        # low_cpu_mem_usage=True,
-        torch_dtype=torch.bfloat16
-    ).to(device).eval()
-    times = []
-    decode_times = []
-    print("Warming up...")
-    vocab_size = tokenizer.vocab_size
-    warmup_token_len = 20
-    random_token_ids = torch.randint(3, vocab_size - 200, (warmup_token_len - 5,), dtype=torch.long)
-    start_tokens = [151331, 151333, 151336, 198]
-    end_tokens = [151337]
-    input_ids = torch.tensor(start_tokens + random_token_ids.tolist() + end_tokens, dtype=torch.long).unsqueeze(0).to(
-        device)
-    attention_mask = torch.ones_like(input_ids, dtype=torch.bfloat16).to(device)
-    position_ids = torch.arange(len(input_ids[0]), dtype=torch.bfloat16).unsqueeze(0).to(device)
-    warmup_inputs = {
-        'input_ids': input_ids,
-        'attention_mask': attention_mask,
-        'position_ids': position_ids
-    }
-    with torch.no_grad():
-        _ = model.generate(
-            input_ids=warmup_inputs['input_ids'],
-            attention_mask=warmup_inputs['attention_mask'],
-            max_new_tokens=2048,
-            do_sample=False,
-            repetition_penalty=1.0,
-            eos_token_id=[151329, 151336, 151338]
-        )
-    print("Warming up complete. Starting stress test...")
-    for i in range(n):
-        random_token_ids = torch.randint(3, vocab_size - 200, (token_len - 5,), dtype=torch.long)
-        input_ids = torch.tensor(start_tokens + random_token_ids.tolist() + end_tokens, dtype=torch.long).unsqueeze(
-            0).to(device)
-        attention_mask = torch.ones_like(input_ids, dtype=torch.bfloat16).to(device)
-        position_ids = torch.arange(len(input_ids[0]), dtype=torch.bfloat16).unsqueeze(0).to(device)
-        test_inputs = {
-            'input_ids': input_ids,
-            'attention_mask': attention_mask,
-            'position_ids': position_ids
-        }
-        streamer = TextIteratorStreamer(
-            tokenizer=tokenizer,
-            timeout=36000,
-            skip_prompt=True,
-            skip_special_tokens=True
-        )
-        generate_kwargs = {
-            "input_ids": test_inputs['input_ids'],
-            "attention_mask": test_inputs['attention_mask'],
-            "max_new_tokens": 512,
-            "do_sample": False,
-            "repetition_penalty": 1.0,
-            "eos_token_id": [151329, 151336, 151338],
-            "streamer": streamer
-        }
-        start_time = time.time()
-        t = Thread(target=model.generate, kwargs=generate_kwargs)
-        t.start()
-        first_token_time = None
-        all_token_times = []
-        for token in streamer:
-            current_time = time.time()
-            if first_token_time is None:
-                first_token_time = current_time
-                times.append(first_token_time - start_time)
-            all_token_times.append(current_time)
-        t.join()
-        end_time = time.time()
-        avg_decode_time_per_token = len(all_token_times) / (end_time - first_token_time) if all_token_times else 0
-        decode_times.append(avg_decode_time_per_token)
-        print(
-            f"Iteration {i + 1}/{n} - Prefilling Time: {times[-1]:.4f} seconds - Average Decode Time: {avg_decode_time_per_token:.4f} tokens/second")
-        torch.cuda.empty_cache()
-    avg_first_token_time = sum(times) / n
-    avg_decode_time = sum(decode_times) / n
-    print(f"\nAverage First Token Time over {n} iterations: {avg_first_token_time:.4f} seconds")
-    print(f"Average Decode Time per Token over {n} iterations: {avg_decode_time:.4f} tokens/second")
-    return times, avg_first_token_time, decode_times, avg_decode_time
-def main():
-    parser = argparse.ArgumentParser(description="Stress test for model inference")
-    parser.add_argument('--token_len', type=int, default=1000, help='Number of tokens for each test')
-    parser.add_argument('--n', type=int, default=3, help='Number of iterations for the stress test')
-    parser.add_argument('--num_gpu', type=int, default=1, help='Number of GPUs to use for inference')
-    args = parser.parse_args()
-    token_len = args.token_len
-    n = args.n
-    num_gpu = args.num_gpu
-    stress_test(token_len, n, num_gpu)
-if __name__ == "__main__":
-    main()
--- a/basic_demo/vllm_cli_demo.py
+++ b/basic_demo/vllm_cli_demo.py
-"""
-This script creates a CLI demo with vllm backand for the glm-4-9b model,
-allowing users to interact with the model through a command-line interface.
-Usage:
- Run the script to start the CLI demo.
- Interact with the model by typing questions and receiving responses.
-Note: The script includes a modification to handle markdown to plain text conversion,
-ensuring that the CLI interface displays formatted text correctly.
-"""
-import time
-import asyncio
-import argparse
-from transformers import AutoTokenizer
-from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine
-from typing import List, Dict
-# add model path
-parser = argparse.ArgumentParser()
-parser.add_argument('--model_name_or_path', default='THUDM/glm-4-9b')
-args = parser.parse_args()
-# MODEL_PATH = 'THUDM/glm-4-9b'
-MODEL_PATH = args.model_name_or_path
-def load_model_and_tokenizer(model_dir: str):
-    engine_args = AsyncEngineArgs(
-        model=model_dir,
-        tokenizer=model_dir,
-        tensor_parallel_size=1,
-        dtype="bfloat16",
-        trust_remote_code=True,
-        gpu_memory_utilization=0.3,
-        enforce_eager=True,
-        worker_use_ray=True,
-        engine_use_ray=False,
-        disable_log_requests=True
-        # 如果遇见 OOM 现象，建议开启下述参数
-        # enable_chunked_prefill=True,
-        # max_num_batched_tokens=8192
-    )
-    tokenizer = AutoTokenizer.from_pretrained(
-        model_dir,
-        trust_remote_code=True,
-        encode_special_tokens=True
-    )
-    engine = AsyncLLMEngine.from_engine_args(engine_args)
-    return engine, tokenizer
-engine, tokenizer = load_model_and_tokenizer(MODEL_PATH)
-async def vllm_gen(messages: List[Dict[str, str]], top_p: float, temperature: float, max_dec_len: int):
-    inputs = tokenizer.apply_chat_template(
-        messages,
-        add_generation_prompt=True,
-        tokenize=False
-    )
-    params_dict = {
-        "n": 1,
-        "best_of": 1,
-        "presence_penalty": 1.0,
-        "frequency_penalty": 0.0,
-        "temperature": temperature,
-        "top_p": top_p,
-        "top_k": -1,
-        "use_beam_search": False,
-        "length_penalty": 1,
-        "early_stopping": False,
-        "stop_token_ids": [151329, 151336, 151338],
-        "ignore_eos": False,
-        "max_tokens": max_dec_len,
-        "logprobs": None,
-        "prompt_logprobs": None,
-        "skip_special_tokens": True,
-    }
-    sampling_params = SamplingParams(**params_dict)
-    async for output in engine.generate(inputs=inputs, sampling_params=sampling_params, request_id=f"{time.time()}"):
-        yield output.outputs[0].text
-async def chat():
-    history = []
-    max_length = 8192
-    top_p = 0.8
-    temperature = 0.6
-    print("Welcome to the GLM-4-9B CLI chat. Type your messages below.")
-    while True:
-        user_input = input("\nYou: ")
-        if user_input.lower() in ["exit", "quit"]:
-            break
-        history.append([user_input, ""])
-        messages = []
-        for idx, (user_msg, model_msg) in enumerate(history):
-            if idx == len(history) - 1 and not model_msg:
-                messages.append({"role": "user", "content": user_msg})
-                break
-            if user_msg:
-                messages.append({"role": "user", "content": user_msg})
-            if model_msg:
-                messages.append({"role": "assistant", "content": model_msg})
-        print("\nGLM-4: ", end="")
-        current_length = 0
-        output = ""
-        async for output in vllm_gen(messages, top_p, temperature, max_length):
-            print(output[current_length:], end="", flush=True)
-            current_length = len(output)
-        history[-1][1] = output
-if __name__ == "__main__":
-    asyncio.run(chat())