Commit 599cfae1 authored by Rayyyyy's avatar Rayyyyy
Browse files

Delete some codes about vllm

parent 7f9c28a1
# Basic Demo
Read this in [English](README_en.md)
本 demo 中,你将体验到如何使用 GLM-4-9B 开源模型进行基本的任务。
请严格按照文档的步骤进行操作,以避免不必要的错误。
## 设备和依赖检查
### 相关推理测试数据
**本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同,请以实际运行环境为准。**
测试硬件信息:
+ OS: Ubuntu 22.04
+ Memory: 512GB
+ Python: 3.12.3
+ CUDA Version: 12.3
+ GPU Driver: 535.104.05
+ GPU: NVIDIA A100-SXM4-80GB * 8
相关推理的压力测试数据如下:
**所有测试均在单张GPU上进行测试,所有显存消耗都按照峰值左右进行测算**
#### GLM-4-9B-Chat
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|--------------|
| BF16 | 19047MiB | 0.1554s | 27.8193 tokens/s | 输入长度为 1000 |
| BF16 | 20629MiB | 0.8199s | 31.8613 tokens/s | 输入长度为 8000 |
| BF16 | 27779MiB | 4.3554s | 14.4108 tokens/s | 输入长度为 32000 |
| BF16 | 57379MiB | 38.1467s | 3.4205 tokens/s | 输入长度为 128000 |
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|-------------|
| Int4 | 8251MiB | 0.1667s | 23.3903 tokens/s | 输入长度为 1000 |
| Int4 | 9613MiB | 0.8629s | 23.4248 tokens/s | 输入长度为 8000 |
| Int4 | 16065MiB | 4.3906s | 14.6553 tokens/s | 输入长度为 32000 |
### GLM-4-9B-Chat-1M
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|--------------|
| BF16 | 74497MiB | 98.4930s | 2.3653 tokens/s | 输入长度为 200000 |
如果您的输入超过200K,我们建议您使用vLLM后端进行多卡推理,以获得更好的性能。
#### GLM-4V-9B
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|------------|
| BF16 | 28131MiB | 0.1016s | 33.4660 tokens/s | 输入长度为 1000 |
| BF16 | 33043MiB | 0.7935a | 39.2444 tokens/s | 输入长度为 8000 |
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|------------|
| Int4 | 10267MiB | 0.1685a | 28.7101 tokens/s | 输入长度为 1000 |
| Int4 | 14105MiB | 0.8629s | 24.2370 tokens/s | 输入长度为 8000 |
### 最低硬件要求
如果您希望运行官方提供的最基础代码 (transformers 后端) 您需要:
+ Python >= 3.10
+ 内存不少于 32 GB
如果您希望运行官方提供的本文件夹的所有代码,您还需要:
+ Linux 操作系统 (Debian 系列最佳)
+ 大于 8GB 显存的,支持 CUDA 或者 ROCM 并且支持 `BF16` 推理的 GPU 设备 (A100以上GPU,V100,20以及更老的GPU架构不受支持)
安装依赖
```shell
pip install -r requirements.txt
```
## 基础功能调用
**除非特殊说明,本文件夹所有 demo 并不支持 Function Call 和 All Tools 等进阶用法**
### 使用 transformers 后端代码
+ 使用命令行与 GLM-4-9B 模型进行对话。
```shell
python trans_cli_demo.py # GLM-4-9B-Chat
python trans_cli_vision_demo.py # GLM-4V-9B
```
+ 使用 Gradio 网页端与 GLM-4-9B-Chat 模型进行对话。
```shell
python trans_web_demo.py
```
+ 使用 Batch 推理。
```shell
python cli_batch_request_demo.py
```
### 使用 vLLM 后端代码
+ 使用命令行与 GLM-4-9B-Chat 模型进行对话。
```shell
python vllm_cli_demo.py
```
+ 自行构建服务端,并使用 `OpenAI API` 的请求格式与 GLM-4-9B-Chat 模型进行对话。本 demo 支持 Function Call 和 All Tools功能。
启动服务端:
```shell
python openai_api_server.py
```
客户端请求:
```shell
python openai_api_request.py
```
## 压力测试
用户可以在自己的设备上使用本代码测试模型在 transformers后端的生成速度:
```shell
python trans_stress_test.py
```
# Basic Demo
In this demo, you will experience how to use the GLM-4-9B open source model to perform basic tasks.
Please follow the steps in the document strictly to avoid unnecessary errors.
## Device and dependency check
### Related inference test data
**The data in this document are tested in the following hardware environment. The actual operating environment
requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
environment. **
Test hardware information:
+ OS: Ubuntu 22.04
+ Memory: 512GB
+ Python: 3.12.3
+ CUDA Version: 12.3
+ GPU Driver: 535.104.05
+ GPU: NVIDIA A100-SXM4-80GB * 8
The stress test data of relevant inference are as follows:
**All tests are performed on a single GPU, and all video memory consumption is calculated based on the peak value**
#
### GLM-4-9B-Chat
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|------------------------|
| BF16 | 19047MiB | 0.1554s | 27.8193 tokens/s | Input length is 1000 |
| BF16 | 20629MiB | 0.8199s | 31.8613 tokens/s | Input length is 8000 |
| BF16 | 27779MiB | 4.3554s | 14.4108 tokens/s | Input length is 32000 |
| BF16 | 57379MiB | 38.1467s | 3.4205 tokens/s | Input length is 128000 |
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|-----------------------|
| Int4 | 8251MiB | 0.1667s | 23.3903 tokens/s | Input length is 1000 |
| Int4 | 9613MiB | 0.8629s | 23.4248 tokens/s | Input length is 8000 |
| Int4 | 16065MiB | 4.3906s | 14.6553 tokens/s | Input length is 32000 |
### GLM-4-9B-Chat-1M
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|--------------|
| BF16 | 74497MiB | 98.4930s | 2.3653 tokens/s | 输入长度为 200000 |
If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better performance.
#### GLM-4V-9B
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|----------------------|
| BF16 | 28131MiB | 0.1016s | 33.4660 tokens/s | Input length is 1000 |
| BF16 | 33043MiB | 0.7935a | 39.2444 tokens/s | Input length is 8000 |
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|----------------------|
| Int4 | 10267MiB | 0.1685a | 28.7101 tokens/s | Input length is 1000 |
| Int4 | 14105MiB | 0.8629s | 24.2370 tokens/s | Input length is 8000 |
### Minimum hardware requirements
If you want to run the most basic code provided by the official (transformers backend) you need:
+ Python >= 3.10
+ Memory of at least 32 GB
If you want to run all the codes in this folder provided by the official, you also need:
+ Linux operating system (Debian series is best)
+ GPU device with more than 8GB video memory, supporting CUDA or ROCM and supporting `BF16` reasoning (GPUs above A100,
V100, 20 and older GPU architectures are not supported)
Install dependencies
```shell
pip install -r requirements.txt
```
## Basic function calls
**Unless otherwise specified, all demos in this folder do not support advanced usage such as Function Call and All Tools
**
### Use transformers backend code
+ Use the command line to communicate with the GLM-4-9B model.
```shell
python trans_cli_demo.py # GLM-4-9B-Chat
python trans_cli_vision_demo.py # GLM-4V-9B
```
+ Use the Gradio web client to communicate with the GLM-4-9B-Chat model.
```shell
python trans_web_demo.py
```
+ Use Batch inference.
```shell
python cli_batch_request_demo.py
```
### Use vLLM backend code
+ Use the command line to communicate with the GLM-4-9B-Chat model.
```shell
python vllm_cli_demo.py
```
+ Build the server by yourself and use the request format of `OpenAI API` to communicate with the glm-4-9b model. This
demo supports Function Call and All Tools functions.
Start the server:
```shell
python openai_api_server.py
```
Client request:
```shell
python openai_api_request.py
```
## Stress test
Users can use this code to test the generation speed of the model on the transformers backend on their own devices:
```shell
python trans_stress_test.py
```
\ No newline at end of file
"""
This script creates a OpenAI Request demo for the glm-4-9b model, just Use OpenAI API to interact with the model.
"""
from openai import OpenAI
base_url = "http://127.0.0.1:8000/v1/"
client = OpenAI(api_key="EMPTY", base_url=base_url)
def function_chat():
messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]
# All Tools 能力: 绘图
# messages = [{"role": "user", "content": "帮我画一张天空的画画吧"}]
# tools = [{"type": "cogview"}]
#
# All Tools 能力: 联网查询
# messages = [{"role": "user", "content": "今天黄金的价格"}]
# tools = [{"type": "simple_browser"}]
response = client.chat.completions.create(
model="glm-4",
messages=messages,
tools=tools,
tool_choice="auto", # use "auto" to let the model choose the tool automatically
# tool_choice={"type": "function", "function": {"name": "my_function"}},
)
if response:
content = response.choices[0].message.content
print(content)
else:
print("Error:", response.status_code)
def simple_chat(use_stream=False):
messages = [
{
"role": "system",
"content": "你是 GLM-4,请你热情回答用户的问题。",
},
{
"role": "user",
"content": "你好,请你用生动的话语给我讲一个小故事吧"
}
]
response = client.chat.completions.create(
model="glm-4",
messages=messages,
stream=use_stream,
max_tokens=1024,
temperature=0.8,
presence_penalty=1.1,
top_p=0.8)
if response:
if use_stream:
for chunk in response:
print(chunk.choices[0].delta.content)
else:
content = response.choices[0].message.content
print(content)
else:
print("Error:", response.status_code)
if __name__ == "__main__":
simple_chat()
function_chat()
import os
import time
from asyncio.log import logger
import uvicorn
import gc
import json
import torch
from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine
from fastapi import FastAPI, HTTPException, Response
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
from typing import List, Literal, Optional, Union
from pydantic import BaseModel, Field
from transformers import AutoTokenizer, LogitsProcessor
from sse_starlette.sse import EventSourceResponse
EventSourceResponse.DEFAULT_PING_INTERVAL = 1000
MODEL_PATH = 'THUDM/glm-4-9b-chat'
MAX_MODEL_LENGTH = 8192
@asynccontextmanager
async def lifespan(app: FastAPI):
yield
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.ipc_collect()
app = FastAPI(lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
class ModelCard(BaseModel):
id: str
object: str = "model"
created: int = Field(default_factory=lambda: int(time.time()))
owned_by: str = "owner"
root: Optional[str] = None
parent: Optional[str] = None
permission: Optional[list] = None
class ModelList(BaseModel):
object: str = "list"
data: List[ModelCard] = []
class FunctionCallResponse(BaseModel):
name: Optional[str] = None
arguments: Optional[str] = None
class ChatMessage(BaseModel):
role: Literal["user", "assistant", "system", "tool"]
content: str = None
name: Optional[str] = None
function_call: Optional[FunctionCallResponse] = None
class DeltaMessage(BaseModel):
role: Optional[Literal["user", "assistant", "system"]] = None
content: Optional[str] = None
function_call: Optional[FunctionCallResponse] = None
class EmbeddingRequest(BaseModel):
input: Union[List[str], str]
model: str
class CompletionUsage(BaseModel):
prompt_tokens: int
completion_tokens: int
total_tokens: int
class EmbeddingResponse(BaseModel):
data: list
model: str
object: str
usage: CompletionUsage
class UsageInfo(BaseModel):
prompt_tokens: int = 0
total_tokens: int = 0
completion_tokens: Optional[int] = 0
class ChatCompletionRequest(BaseModel):
model: str
messages: List[ChatMessage]
temperature: Optional[float] = 0.8
top_p: Optional[float] = 0.8
max_tokens: Optional[int] = None
stream: Optional[bool] = False
tools: Optional[Union[dict, List[dict]]] = None
tool_choice: Optional[Union[str, dict]] = "None"
repetition_penalty: Optional[float] = 1.1
class ChatCompletionResponseChoice(BaseModel):
index: int
message: ChatMessage
finish_reason: Literal["stop", "length", "function_call"]
class ChatCompletionResponseStreamChoice(BaseModel):
delta: DeltaMessage
finish_reason: Optional[Literal["stop", "length", "function_call"]]
index: int
class ChatCompletionResponse(BaseModel):
model: str
id: str
object: Literal["chat.completion", "chat.completion.chunk"]
choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice]]
created: Optional[int] = Field(default_factory=lambda: int(time.time()))
usage: Optional[UsageInfo] = None
class InvalidScoreLogitsProcessor(LogitsProcessor):
def __call__(
self, input_ids: torch.LongTensor, scores: torch.FloatTensor
) -> torch.FloatTensor:
if torch.isnan(scores).any() or torch.isinf(scores).any():
scores.zero_()
scores[..., 5] = 5e4
return scores
def process_response(output: str, use_tool: bool = False) -> Union[str, dict]:
content = ""
for response in output.split("<|assistant|>"):
if "\n" in response:
metadata, content = response.split("\n", maxsplit=1)
else:
metadata, content = "", response
if not metadata.strip():
content = content.strip()
else:
if use_tool:
parameters = eval(content.strip())
content = {
"name": metadata.strip(),
"arguments": json.dumps(parameters, ensure_ascii=False)
}
else:
content = {
"name": metadata.strip(),
"content": content
}
return content
@torch.inference_mode()
async def generate_stream_glm4(params):
messages = params["messages"]
tools = params["tools"]
tool_choice = params["tool_choice"]
temperature = float(params.get("temperature", 1.0))
repetition_penalty = float(params.get("repetition_penalty", 1.0))
top_p = float(params.get("top_p", 1.0))
max_new_tokens = int(params.get("max_tokens", 8192))
messages = process_messages(messages, tools=tools, tool_choice=tool_choice)
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
params_dict = {
"n": 1,
"best_of": 1,
"presence_penalty": 1.0,
"frequency_penalty": 0.0,
"temperature": temperature,
"top_p": top_p,
"top_k": -1,
"repetition_penalty": repetition_penalty,
"use_beam_search": False,
"length_penalty": 1,
"early_stopping": False,
"stop_token_ids": [151329, 151336, 151338],
"ignore_eos": False,
"max_tokens": max_new_tokens,
"logprobs": None,
"prompt_logprobs": None,
"skip_special_tokens": True,
}
sampling_params = SamplingParams(**params_dict)
async for output in engine.generate(inputs=inputs, sampling_params=sampling_params, request_id="glm-4-9b"):
output_len = len(output.outputs[0].token_ids)
input_len = len(output.prompt_token_ids)
ret = {
"text": output.outputs[0].text,
"usage": {
"prompt_tokens": input_len,
"completion_tokens": output_len,
"total_tokens": output_len + input_len
},
"finish_reason": output.outputs[0].finish_reason,
}
yield ret
gc.collect()
torch.cuda.empty_cache()
def process_messages(messages, tools=None, tool_choice="none"):
_messages = messages
messages = []
msg_has_sys = False
def filter_tools(tool_choice, tools):
function_name = tool_choice.get('function', {}).get('name', None)
if not function_name:
return []
filtered_tools = [
tool for tool in tools
if tool.get('function', {}).get('name') == function_name
]
return filtered_tools
if tool_choice != "none":
if isinstance(tool_choice, dict):
tools = filter_tools(tool_choice, tools)
if tools:
messages.append(
{
"role": "system",
"content": None,
"tools": tools
}
)
msg_has_sys = True
# add to metadata
if isinstance(tool_choice, dict) and tools:
messages.append(
{
"role": "assistant",
"metadata": tool_choice["function"]["name"],
"content": ""
}
)
for m in _messages:
role, content, func_call = m.role, m.content, m.function_call
if role == "function":
messages.append(
{
"role": "observation",
"content": content
}
)
elif role == "assistant" and func_call is not None:
for response in content.split("<|assistant|>"):
if "\n" in response:
metadata, sub_content = response.split("\n", maxsplit=1)
else:
metadata, sub_content = "", response
messages.append(
{
"role": role,
"metadata": metadata,
"content": sub_content.strip()
}
)
else:
if role == "system" and msg_has_sys:
msg_has_sys = False
continue
messages.append({"role": role, "content": content})
return messages
@app.get("/health")
async def health() -> Response:
"""Health check."""
return Response(status_code=200)
@app.get("/v1/models", response_model=ModelList)
async def list_models():
model_card = ModelCard(id="glm-4")
return ModelList(data=[model_card])
@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def create_chat_completion(request: ChatCompletionRequest):
if len(request.messages) < 1 or request.messages[-1].role == "assistant":
raise HTTPException(status_code=400, detail="Invalid request")
gen_params = dict(
messages=request.messages,
temperature=request.temperature,
top_p=request.top_p,
max_tokens=request.max_tokens or 1024,
echo=False,
stream=request.stream,
repetition_penalty=request.repetition_penalty,
tools=request.tools,
tool_choice=request.tool_choice,
)
logger.debug(f"==== request ====\n{gen_params}")
if request.stream:
predict_stream_generator = predict_stream(request.model, gen_params)
output = await anext(predict_stream_generator)
if output:
return EventSourceResponse(predict_stream_generator, media_type="text/event-stream")
logger.debug(f"First result output:\n{output}")
function_call = None
if output and request.tools:
try:
function_call = process_response(output, use_tool=True)
except:
logger.warning("Failed to parse tool call")
# CallFunction
if isinstance(function_call, dict):
function_call = FunctionCallResponse(**function_call)
tool_response = ""
if not gen_params.get("messages"):
gen_params["messages"] = []
gen_params["messages"].append(ChatMessage(role="assistant", content=output))
gen_params["messages"].append(ChatMessage(role="tool", name=function_call.name, content=tool_response))
generate = predict(request.model, gen_params)
return EventSourceResponse(generate, media_type="text/event-stream")
else:
generate = parse_output_text(request.model, output)
return EventSourceResponse(generate, media_type="text/event-stream")
response = ""
async for response in generate_stream_glm4(gen_params):
pass
if response["text"].startswith("\n"):
response["text"] = response["text"][1:]
response["text"] = response["text"].strip()
usage = UsageInfo()
function_call, finish_reason = None, "stop"
if request.tools:
try:
function_call = process_response(response["text"], use_tool=True)
except:
logger.warning(
"Failed to parse tool call, maybe the response is not a function call(such as cogview drawing) or have been answered.")
if isinstance(function_call, dict):
finish_reason = "function_call"
function_call = FunctionCallResponse(**function_call)
message = ChatMessage(
role="assistant",
content=response["text"],
function_call=function_call if isinstance(function_call, FunctionCallResponse) else None,
)
logger.debug(f"==== message ====\n{message}")
choice_data = ChatCompletionResponseChoice(
index=0,
message=message,
finish_reason=finish_reason,
)
task_usage = UsageInfo.model_validate(response["usage"])
for usage_key, usage_value in task_usage.model_dump().items():
setattr(usage, usage_key, getattr(usage, usage_key) + usage_value)
return ChatCompletionResponse(
model=request.model,
id="", # for open_source model, id is empty
choices=[choice_data],
object="chat.completion",
usage=usage
)
async def predict(model_id: str, params: dict):
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(role="assistant"),
finish_reason=None
)
chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
previous_text = ""
async for new_response in generate_stream_glm4(params):
decoded_unicode = new_response["text"]
delta_text = decoded_unicode[len(previous_text):]
previous_text = decoded_unicode
finish_reason = new_response["finish_reason"]
if len(delta_text) == 0 and finish_reason != "function_call":
continue
function_call = None
if finish_reason == "function_call":
try:
function_call = process_response(decoded_unicode, use_tool=True)
except:
logger.warning(
"Failed to parse tool call, maybe the response is not a tool call or have been answered.")
if isinstance(function_call, dict):
function_call = FunctionCallResponse(**function_call)
delta = DeltaMessage(
content=delta_text,
role="assistant",
function_call=function_call if isinstance(function_call, FunctionCallResponse) else None,
)
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=delta,
finish_reason=finish_reason
)
chunk = ChatCompletionResponse(
model=model_id,
id="",
choices=[choice_data],
object="chat.completion.chunk"
)
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(),
finish_reason="stop"
)
chunk = ChatCompletionResponse(
model=model_id,
id="",
choices=[choice_data],
object="chat.completion.chunk"
)
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
yield '[DONE]'
async def predict_stream(model_id, gen_params):
output = ""
is_function_call = False
has_send_first_chunk = False
async for new_response in generate_stream_glm4(gen_params):
decoded_unicode = new_response["text"]
delta_text = decoded_unicode[len(output):]
output = decoded_unicode
if not is_function_call and len(output) > 7:
is_function_call = output and 'get_' in output
if is_function_call:
continue
finish_reason = new_response["finish_reason"]
if not has_send_first_chunk:
message = DeltaMessage(
content="",
role="assistant",
function_call=None,
)
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=message,
finish_reason=finish_reason
)
chunk = ChatCompletionResponse(
model=model_id,
id="",
choices=[choice_data],
created=int(time.time()),
object="chat.completion.chunk"
)
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
send_msg = delta_text if has_send_first_chunk else output
has_send_first_chunk = True
message = DeltaMessage(
content=send_msg,
role="assistant",
function_call=None,
)
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=message,
finish_reason=finish_reason
)
chunk = ChatCompletionResponse(
model=model_id,
id="",
choices=[choice_data],
created=int(time.time()),
object="chat.completion.chunk"
)
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
if is_function_call:
yield output
else:
yield '[DONE]'
async def parse_output_text(model_id: str, value: str):
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(role="assistant", content=value),
finish_reason=None
)
chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
choice_data = ChatCompletionResponseStreamChoice(
index=0,
delta=DeltaMessage(),
finish_reason="stop"
)
chunk = ChatCompletionResponse(model=model_id, id="", choices=[choice_data], object="chat.completion.chunk")
yield "{}".format(chunk.model_dump_json(exclude_unset=True))
yield '[DONE]'
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
engine_args = AsyncEngineArgs(
model=MODEL_PATH,
tokenizer=MODEL_PATH,
tensor_parallel_size=1,
dtype="bfloat16",
trust_remote_code=True,
gpu_memory_utilization=0.9,
enforce_eager=True,
worker_use_ray=True,
engine_use_ray=False,
disable_log_requests=True,
max_model_len=MAX_MODEL_LENGTH,
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
uvicorn.run(app, host='0.0.0.0', port=8000, workers=1)
"""
Here is an example of using batch request glm-4-9b,
here you need to build the conversation format yourself and then call the batch function to make batch requests.
Please note that in this demo, the memory consumption is significantly higher.
"""
from typing import Optional, Union
from transformers import AutoModel, AutoTokenizer, LogitsProcessorList
MODEL_PATH = 'THUDM/glm-4-9b-chat'
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
encode_special_tokens=True)
model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, device_map="auto").eval()
def process_model_outputs(inputs, outputs, tokenizer):
responses = []
for input_ids, output_ids in zip(inputs.input_ids, outputs):
response = tokenizer.decode(output_ids[len(input_ids):], skip_special_tokens=True).strip()
responses.append(response)
return responses
def batch(
model,
tokenizer,
messages: Union[str, list[str]],
max_input_tokens: int = 8192,
max_new_tokens: int = 8192,
num_beams: int = 1,
do_sample: bool = True,
top_p: float = 0.8,
temperature: float = 0.8,
logits_processor: Optional[LogitsProcessorList] = LogitsProcessorList(),
):
messages = [messages] if isinstance(messages, str) else messages
batched_inputs = tokenizer(messages, return_tensors="pt", padding="max_length", truncation=True,
max_length=max_input_tokens).to(model.device)
gen_kwargs = {
"max_new_tokens": max_new_tokens,
"num_beams": num_beams,
"do_sample": do_sample,
"top_p": top_p,
"temperature": temperature,
"logits_processor": logits_processor,
"eos_token_id": model.config.eos_token_id
}
batched_outputs = model.generate(**batched_inputs, **gen_kwargs)
batched_response = process_model_outputs(batched_inputs, batched_outputs, tokenizer)
return batched_response
if __name__ == "__main__":
batch_message = [
[
{"role": "user", "content": "我的爸爸和妈妈结婚为什么不能带我去"},
{"role": "assistant", "content": "因为他们结婚时你还没有出生"},
{"role": "user", "content": "我刚才的提问是"}
],
[
{"role": "user", "content": "你好,你是谁"}
]
]
batch_inputs = []
max_input_tokens = 1024
for i, messages in enumerate(batch_message):
new_batch_input = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
max_input_tokens = max(max_input_tokens, len(new_batch_input))
batch_inputs.append(new_batch_input)
gen_kwargs = {
"max_input_tokens": max_input_tokens,
"max_new_tokens": 8192,
"do_sample": True,
"top_p": 0.8,
"temperature": 0.8,
"num_beams": 1,
}
batch_responses = batch(model, tokenizer, batch_inputs, **gen_kwargs)
for response in batch_responses:
print("=" * 10)
print(response)
import argparse
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer, BitsAndBytesConfig
import torch
from threading import Thread
MODEL_PATH = 'THUDM/glm-4-9b-chat'
def stress_test(token_len, n, num_gpu):
device = torch.device(f"cuda:{num_gpu - 1}" if torch.cuda.is_available() and num_gpu > 0 else "cpu")
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
padding_side="left"
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
# quantization_config=BitsAndBytesConfig(load_in_4bit=True),
# low_cpu_mem_usage=True,
torch_dtype=torch.bfloat16
).to(device).eval()
times = []
decode_times = []
print("Warming up...")
vocab_size = tokenizer.vocab_size
warmup_token_len = 20
random_token_ids = torch.randint(3, vocab_size - 200, (warmup_token_len - 5,), dtype=torch.long)
start_tokens = [151331, 151333, 151336, 198]
end_tokens = [151337]
input_ids = torch.tensor(start_tokens + random_token_ids.tolist() + end_tokens, dtype=torch.long).unsqueeze(0).to(
device)
attention_mask = torch.ones_like(input_ids, dtype=torch.bfloat16).to(device)
position_ids = torch.arange(len(input_ids[0]), dtype=torch.bfloat16).unsqueeze(0).to(device)
warmup_inputs = {
'input_ids': input_ids,
'attention_mask': attention_mask,
'position_ids': position_ids
}
with torch.no_grad():
_ = model.generate(
input_ids=warmup_inputs['input_ids'],
attention_mask=warmup_inputs['attention_mask'],
max_new_tokens=2048,
do_sample=False,
repetition_penalty=1.0,
eos_token_id=[151329, 151336, 151338]
)
print("Warming up complete. Starting stress test...")
for i in range(n):
random_token_ids = torch.randint(3, vocab_size - 200, (token_len - 5,), dtype=torch.long)
input_ids = torch.tensor(start_tokens + random_token_ids.tolist() + end_tokens, dtype=torch.long).unsqueeze(
0).to(device)
attention_mask = torch.ones_like(input_ids, dtype=torch.bfloat16).to(device)
position_ids = torch.arange(len(input_ids[0]), dtype=torch.bfloat16).unsqueeze(0).to(device)
test_inputs = {
'input_ids': input_ids,
'attention_mask': attention_mask,
'position_ids': position_ids
}
streamer = TextIteratorStreamer(
tokenizer=tokenizer,
timeout=36000,
skip_prompt=True,
skip_special_tokens=True
)
generate_kwargs = {
"input_ids": test_inputs['input_ids'],
"attention_mask": test_inputs['attention_mask'],
"max_new_tokens": 512,
"do_sample": False,
"repetition_penalty": 1.0,
"eos_token_id": [151329, 151336, 151338],
"streamer": streamer
}
start_time = time.time()
t = Thread(target=model.generate, kwargs=generate_kwargs)
t.start()
first_token_time = None
all_token_times = []
for token in streamer:
current_time = time.time()
if first_token_time is None:
first_token_time = current_time
times.append(first_token_time - start_time)
all_token_times.append(current_time)
t.join()
end_time = time.time()
avg_decode_time_per_token = len(all_token_times) / (end_time - first_token_time) if all_token_times else 0
decode_times.append(avg_decode_time_per_token)
print(
f"Iteration {i + 1}/{n} - Prefilling Time: {times[-1]:.4f} seconds - Average Decode Time: {avg_decode_time_per_token:.4f} tokens/second")
torch.cuda.empty_cache()
avg_first_token_time = sum(times) / n
avg_decode_time = sum(decode_times) / n
print(f"\nAverage First Token Time over {n} iterations: {avg_first_token_time:.4f} seconds")
print(f"Average Decode Time per Token over {n} iterations: {avg_decode_time:.4f} tokens/second")
return times, avg_first_token_time, decode_times, avg_decode_time
def main():
parser = argparse.ArgumentParser(description="Stress test for model inference")
parser.add_argument('--token_len', type=int, default=1000, help='Number of tokens for each test')
parser.add_argument('--n', type=int, default=3, help='Number of iterations for the stress test')
parser.add_argument('--num_gpu', type=int, default=1, help='Number of GPUs to use for inference')
args = parser.parse_args()
token_len = args.token_len
n = args.n
num_gpu = args.num_gpu
stress_test(token_len, n, num_gpu)
if __name__ == "__main__":
main()
"""
This script creates a CLI demo with vllm backand for the glm-4-9b model,
allowing users to interact with the model through a command-line interface.
Usage:
- Run the script to start the CLI demo.
- Interact with the model by typing questions and receiving responses.
Note: The script includes a modification to handle markdown to plain text conversion,
ensuring that the CLI interface displays formatted text correctly.
"""
import time
import asyncio
import argparse
from transformers import AutoTokenizer
from vllm import SamplingParams, AsyncEngineArgs, AsyncLLMEngine
from typing import List, Dict
# add model path
parser = argparse.ArgumentParser()
parser.add_argument('--model_name_or_path', default='THUDM/glm-4-9b')
args = parser.parse_args()
# MODEL_PATH = 'THUDM/glm-4-9b'
MODEL_PATH = args.model_name_or_path
def load_model_and_tokenizer(model_dir: str):
engine_args = AsyncEngineArgs(
model=model_dir,
tokenizer=model_dir,
tensor_parallel_size=1,
dtype="bfloat16",
trust_remote_code=True,
gpu_memory_utilization=0.3,
enforce_eager=True,
worker_use_ray=True,
engine_use_ray=False,
disable_log_requests=True
# 如果遇见 OOM 现象,建议开启下述参数
# enable_chunked_prefill=True,
# max_num_batched_tokens=8192
)
tokenizer = AutoTokenizer.from_pretrained(
model_dir,
trust_remote_code=True,
encode_special_tokens=True
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
return engine, tokenizer
engine, tokenizer = load_model_and_tokenizer(MODEL_PATH)
async def vllm_gen(messages: List[Dict[str, str]], top_p: float, temperature: float, max_dec_len: int):
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False
)
params_dict = {
"n": 1,
"best_of": 1,
"presence_penalty": 1.0,
"frequency_penalty": 0.0,
"temperature": temperature,
"top_p": top_p,
"top_k": -1,
"use_beam_search": False,
"length_penalty": 1,
"early_stopping": False,
"stop_token_ids": [151329, 151336, 151338],
"ignore_eos": False,
"max_tokens": max_dec_len,
"logprobs": None,
"prompt_logprobs": None,
"skip_special_tokens": True,
}
sampling_params = SamplingParams(**params_dict)
async for output in engine.generate(inputs=inputs, sampling_params=sampling_params, request_id=f"{time.time()}"):
yield output.outputs[0].text
async def chat():
history = []
max_length = 8192
top_p = 0.8
temperature = 0.6
print("Welcome to the GLM-4-9B CLI chat. Type your messages below.")
while True:
user_input = input("\nYou: ")
if user_input.lower() in ["exit", "quit"]:
break
history.append([user_input, ""])
messages = []
for idx, (user_msg, model_msg) in enumerate(history):
if idx == len(history) - 1 and not model_msg:
messages.append({"role": "user", "content": user_msg})
break
if user_msg:
messages.append({"role": "user", "content": user_msg})
if model_msg:
messages.append({"role": "assistant", "content": model_msg})
print("\nGLM-4: ", end="")
current_length = 0
output = ""
async for output in vllm_gen(messages, top_p, temperature, max_length):
print(output[current_length:], end="", flush=True)
current_length = len(output)
history[-1][1] = output
if __name__ == "__main__":
asyncio.run(chat())
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment