Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
ModelZoo
GLM-4_pytorch
Commits
599cfae1
Commit
599cfae1
authored
Jul 26, 2024
by
Rayyyyy
Browse files
Delete some codes about vllm
parent
7f9c28a1
Changes
7
Show whitespace changes
Inline
Side-by-side
Showing
7 changed files
with
0 additions
and
1249 deletions
+0
-1249
basic_demo/README.md
basic_demo/README.md
+0
-137
basic_demo/README_en.md
basic_demo/README_en.md
+0
-138
basic_demo/openai_api_request.py
basic_demo/openai_api_request.py
+0
-88
basic_demo/openai_api_server.py
basic_demo/openai_api_server.py
+0
-549
basic_demo/trans_batch_demo.py
basic_demo/trans_batch_demo.py
+0
-90
basic_demo/trans_stress_test.py
basic_demo/trans_stress_test.py
+0
-128
basic_demo/vllm_cli_demo.py
basic_demo/vllm_cli_demo.py
+0
-119
No files found.
basic_demo/README.md
deleted
100644 → 0
View file @
7f9c28a1
# Basic Demo
Read this in
[
English
](
README_en.md
)
本 demo 中,你将体验到如何使用 GLM-4-9B 开源模型进行基本的任务。
请严格按照文档的步骤进行操作,以避免不必要的错误。
## 设备和依赖检查
### 相关推理测试数据
**本文档的数据均在以下硬件环境测试,实际运行环境需求和运行占用的显存略有不同,请以实际运行环境为准。**
测试硬件信息:
+
OS: Ubuntu 22.04
+
Memory: 512GB
+
Python: 3.12.3
+
CUDA Version: 12.3
+
GPU Driver: 535.104.05
+
GPU: NVIDIA A100-SXM4-80GB
*
8
相关推理的压力测试数据如下:
**所有测试均在单张GPU上进行测试,所有显存消耗都按照峰值左右进行测算**
#### GLM-4-9B-Chat
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|--------------|
| BF16 | 19047MiB | 0.1554s | 27.8193 tokens/s | 输入长度为 1000 |
| BF16 | 20629MiB | 0.8199s | 31.8613 tokens/s | 输入长度为 8000 |
| BF16 | 27779MiB | 4.3554s | 14.4108 tokens/s | 输入长度为 32000 |
| BF16 | 57379MiB | 38.1467s | 3.4205 tokens/s | 输入长度为 128000 |
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|-------------|
| Int4 | 8251MiB | 0.1667s | 23.3903 tokens/s | 输入长度为 1000 |
| Int4 | 9613MiB | 0.8629s | 23.4248 tokens/s | 输入长度为 8000 |
| Int4 | 16065MiB | 4.3906s | 14.6553 tokens/s | 输入长度为 32000 |
### GLM-4-9B-Chat-1M
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|--------------|
| BF16 | 74497MiB | 98.4930s | 2.3653 tokens/s | 输入长度为 200000 |
如果您的输入超过200K,我们建议您使用vLLM后端进行多卡推理,以获得更好的性能。
#### GLM-4V-9B
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|------------|
| BF16 | 28131MiB | 0.1016s | 33.4660 tokens/s | 输入长度为 1000 |
| BF16 | 33043MiB | 0.7935a | 39.2444 tokens/s | 输入长度为 8000 |
| 精度 | 显存占用 | Prefilling / 首响 | Decode Speed | Remarks |
|------|----------|-----------------|------------------|------------|
| Int4 | 10267MiB | 0.1685a | 28.7101 tokens/s | 输入长度为 1000 |
| Int4 | 14105MiB | 0.8629s | 24.2370 tokens/s | 输入长度为 8000 |
### 最低硬件要求
如果您希望运行官方提供的最基础代码 (transformers 后端) 您需要:
+
Python >= 3.10
+
内存不少于 32 GB
如果您希望运行官方提供的本文件夹的所有代码,您还需要:
+
Linux 操作系统 (Debian 系列最佳)
+
大于 8GB 显存的,支持 CUDA 或者 ROCM 并且支持
`BF16`
推理的 GPU 设备 (A100以上GPU,V100,20以及更老的GPU架构不受支持)
安装依赖
```
shell
pip
install
-r
requirements.txt
```
## 基础功能调用
**除非特殊说明,本文件夹所有 demo 并不支持 Function Call 和 All Tools 等进阶用法**
### 使用 transformers 后端代码
+
使用命令行与 GLM-4-9B 模型进行对话。
```
shell
python trans_cli_demo.py
# GLM-4-9B-Chat
python trans_cli_vision_demo.py
# GLM-4V-9B
```
+
使用 Gradio 网页端与 GLM-4-9B-Chat 模型进行对话。
```
shell
python trans_web_demo.py
```
+
使用 Batch 推理。
```
shell
python cli_batch_request_demo.py
```
### 使用 vLLM 后端代码
+
使用命令行与 GLM-4-9B-Chat 模型进行对话。
```
shell
python vllm_cli_demo.py
```
+
自行构建服务端,并使用
`OpenAI API`
的请求格式与 GLM-4-9B-Chat 模型进行对话。本 demo 支持 Function Call 和 All Tools功能。
启动服务端:
```
shell
python openai_api_server.py
```
客户端请求:
```
shell
python openai_api_request.py
```
## 压力测试
用户可以在自己的设备上使用本代码测试模型在 transformers后端的生成速度:
```
shell
python trans_stress_test.py
```
basic_demo/README_en.md
deleted
100644 → 0
View file @
7f9c28a1
# Basic Demo
In this demo, you will experience how to use the GLM-4-9B open source model to perform basic tasks.
Please follow the steps in the document strictly to avoid unnecessary errors.
## Device and dependency check
### Related inference test data
**
The data in this document are tested in the following hardware environment. The actual operating environment
requirements and the video memory occupied by the operation are slightly different. Please refer to the actual operating
environment.
**
Test hardware information:
+
OS: Ubuntu 22.04
+
Memory: 512GB
+
Python: 3.12.3
+
CUDA Version: 12.3
+
GPU Driver: 535.104.05
+
GPU: NVIDIA A100-SXM4-80GB
*
8
The stress test data of relevant inference are as follows:
**All tests are performed on a single GPU, and all video memory consumption is calculated based on the peak value**
#
### GLM-4-9B-Chat
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|------------------------|
| BF16 | 19047MiB | 0.1554s | 27.8193 tokens/s | Input length is 1000 |
| BF16 | 20629MiB | 0.8199s | 31.8613 tokens/s | Input length is 8000 |
| BF16 | 27779MiB | 4.3554s | 14.4108 tokens/s | Input length is 32000 |
| BF16 | 57379MiB | 38.1467s | 3.4205 tokens/s | Input length is 128000 |
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|-----------------------|
| Int4 | 8251MiB | 0.1667s | 23.3903 tokens/s | Input length is 1000 |
| Int4 | 9613MiB | 0.8629s | 23.4248 tokens/s | Input length is 8000 |
| Int4 | 16065MiB | 4.3906s | 14.6553 tokens/s | Input length is 32000 |
### GLM-4-9B-Chat-1M
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|--------------|
| BF16 | 74497MiB | 98.4930s | 2.3653 tokens/s | 输入长度为 200000 |
If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better performance.
#### GLM-4V-9B
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|----------------------|
| BF16 | 28131MiB | 0.1016s | 33.4660 tokens/s | Input length is 1000 |
| BF16 | 33043MiB | 0.7935a | 39.2444 tokens/s | Input length is 8000 |
| Dtype | GPU Memory | Prefilling | Decode Speed | Remarks |
|-------|------------|------------|------------------|----------------------|
| Int4 | 10267MiB | 0.1685a | 28.7101 tokens/s | Input length is 1000 |
| Int4 | 14105MiB | 0.8629s | 24.2370 tokens/s | Input length is 8000 |
### Minimum hardware requirements
If you want to run the most basic code provided by the official (transformers backend) you need:
+
Python >= 3.10
+
Memory of at least 32 GB
If you want to run all the codes in this folder provided by the official, you also need:
+
Linux operating system (Debian series is best)
+
GPU device with more than 8GB video memory, supporting CUDA or ROCM and supporting
`BF16`
reasoning (GPUs above A100,
V100, 20 and older GPU architectures are not supported)
Install dependencies
```
shell
pip
install
-r
requirements.txt
```
## Basic function calls
**
Unless otherwise specified, all demos in this folder do not support advanced usage such as Function Call and All Tools
**
### Use transformers backend code
+
Use the command line to communicate with the GLM-4-9B model.
```
shell
python trans_cli_demo.py
# GLM-4-9B-Chat
python trans_cli_vision_demo.py
# GLM-4V-9B
```
+
Use the Gradio web client to communicate with the GLM-4-9B-Chat model.
```
shell
python trans_web_demo.py
```
+
Use Batch inference.
```
shell
python cli_batch_request_demo.py
```
### Use vLLM backend code
+
Use the command line to communicate with the GLM-4-9B-Chat model.
```
shell
python vllm_cli_demo.py
```
+
Build the server by yourself and use the request format of
`OpenAI API`
to communicate with the glm-4-9b model. This
demo supports Function Call and All Tools functions.
Start the server:
```
shell
python openai_api_server.py
```
Client request:
```
shell
python openai_api_request.py
```
## Stress test
Users can use this code to test the generation speed of the model on the transformers backend on their own devices:
```
shell
python trans_stress_test.py
```
\ No newline at end of file
basic_demo/openai_api_request.py
deleted
100644 → 0
View file @
7f9c28a1
"""
This script creates a OpenAI Request demo for the glm-4-9b model, just Use OpenAI API to interact with the model.
"""
from
openai
import
OpenAI
base_url
=
"http://127.0.0.1:8000/v1/"
client
=
OpenAI
(
api_key
=
"EMPTY"
,
base_url
=
base_url
)
def
function_chat
():
messages
=
[{
"role"
:
"user"
,
"content"
:
"What's the weather like in San Francisco, Tokyo, and Paris?"
}]
tools
=
[
{
"type"
:
"function"
,
"function"
:
{
"name"
:
"get_current_weather"
,
"description"
:
"Get the current weather in a given location"
,
"parameters"
:
{
"type"
:
"object"
,
"properties"
:
{
"location"
:
{
"type"
:
"string"
,
"description"
:
"The city and state, e.g. San Francisco, CA"
,
},
"unit"
:
{
"type"
:
"string"
,
"enum"
:
[
"celsius"
,
"fahrenheit"
]},
},
"required"
:
[
"location"
],
},
},
}
]
# All Tools 能力: 绘图
# messages = [{"role": "user", "content": "帮我画一张天空的画画吧"}]
# tools = [{"type": "cogview"}]
#
# All Tools 能力: 联网查询
# messages = [{"role": "user", "content": "今天黄金的价格"}]
# tools = [{"type": "simple_browser"}]
response
=
client
.
chat
.
completions
.
create
(
model
=
"glm-4"
,
messages
=
messages
,
tools
=
tools
,
tool_choice
=
"auto"
,
# use "auto" to let the model choose the tool automatically
# tool_choice={"type": "function", "function": {"name": "my_function"}},
)
if
response
:
content
=
response
.
choices
[
0
].
message
.
content
print
(
content
)
else
:
print
(
"Error:"
,
response
.
status_code
)
def
simple_chat
(
use_stream
=
False
):
messages
=
[
{
"role"
:
"system"
,
"content"
:
"你是 GLM-4,请你热情回答用户的问题。"
,
},
{
"role"
:
"user"
,
"content"
:
"你好,请你用生动的话语给我讲一个小故事吧"
}
]
response
=
client
.
chat
.
completions
.
create
(
model
=
"glm-4"
,
messages
=
messages
,
stream
=
use_stream
,
max_tokens
=
1024
,
temperature
=
0.8
,
presence_penalty
=
1.1
,
top_p
=
0.8
)
if
response
:
if
use_stream
:
for
chunk
in
response
:
print
(
chunk
.
choices
[
0
].
delta
.
content
)
else
:
content
=
response
.
choices
[
0
].
message
.
content
print
(
content
)
else
:
print
(
"Error:"
,
response
.
status_code
)
if
__name__
==
"__main__"
:
simple_chat
()
function_chat
()
basic_demo/openai_api_server.py
deleted
100644 → 0
View file @
7f9c28a1
import
os
import
time
from
asyncio.log
import
logger
import
uvicorn
import
gc
import
json
import
torch
from
vllm
import
SamplingParams
,
AsyncEngineArgs
,
AsyncLLMEngine
from
fastapi
import
FastAPI
,
HTTPException
,
Response
from
fastapi.middleware.cors
import
CORSMiddleware
from
contextlib
import
asynccontextmanager
from
typing
import
List
,
Literal
,
Optional
,
Union
from
pydantic
import
BaseModel
,
Field
from
transformers
import
AutoTokenizer
,
LogitsProcessor
from
sse_starlette.sse
import
EventSourceResponse
EventSourceResponse
.
DEFAULT_PING_INTERVAL
=
1000
MODEL_PATH
=
'THUDM/glm-4-9b-chat'
MAX_MODEL_LENGTH
=
8192
@
asynccontextmanager
async
def
lifespan
(
app
:
FastAPI
):
yield
if
torch
.
cuda
.
is_available
():
torch
.
cuda
.
empty_cache
()
torch
.
cuda
.
ipc_collect
()
app
=
FastAPI
(
lifespan
=
lifespan
)
app
.
add_middleware
(
CORSMiddleware
,
allow_origins
=
[
"*"
],
allow_credentials
=
True
,
allow_methods
=
[
"*"
],
allow_headers
=
[
"*"
],
)
class
ModelCard
(
BaseModel
):
id
:
str
object
:
str
=
"model"
created
:
int
=
Field
(
default_factory
=
lambda
:
int
(
time
.
time
()))
owned_by
:
str
=
"owner"
root
:
Optional
[
str
]
=
None
parent
:
Optional
[
str
]
=
None
permission
:
Optional
[
list
]
=
None
class
ModelList
(
BaseModel
):
object
:
str
=
"list"
data
:
List
[
ModelCard
]
=
[]
class
FunctionCallResponse
(
BaseModel
):
name
:
Optional
[
str
]
=
None
arguments
:
Optional
[
str
]
=
None
class
ChatMessage
(
BaseModel
):
role
:
Literal
[
"user"
,
"assistant"
,
"system"
,
"tool"
]
content
:
str
=
None
name
:
Optional
[
str
]
=
None
function_call
:
Optional
[
FunctionCallResponse
]
=
None
class
DeltaMessage
(
BaseModel
):
role
:
Optional
[
Literal
[
"user"
,
"assistant"
,
"system"
]]
=
None
content
:
Optional
[
str
]
=
None
function_call
:
Optional
[
FunctionCallResponse
]
=
None
class
EmbeddingRequest
(
BaseModel
):
input
:
Union
[
List
[
str
],
str
]
model
:
str
class
CompletionUsage
(
BaseModel
):
prompt_tokens
:
int
completion_tokens
:
int
total_tokens
:
int
class
EmbeddingResponse
(
BaseModel
):
data
:
list
model
:
str
object
:
str
usage
:
CompletionUsage
class
UsageInfo
(
BaseModel
):
prompt_tokens
:
int
=
0
total_tokens
:
int
=
0
completion_tokens
:
Optional
[
int
]
=
0
class
ChatCompletionRequest
(
BaseModel
):
model
:
str
messages
:
List
[
ChatMessage
]
temperature
:
Optional
[
float
]
=
0.8
top_p
:
Optional
[
float
]
=
0.8
max_tokens
:
Optional
[
int
]
=
None
stream
:
Optional
[
bool
]
=
False
tools
:
Optional
[
Union
[
dict
,
List
[
dict
]]]
=
None
tool_choice
:
Optional
[
Union
[
str
,
dict
]]
=
"None"
repetition_penalty
:
Optional
[
float
]
=
1.1
class
ChatCompletionResponseChoice
(
BaseModel
):
index
:
int
message
:
ChatMessage
finish_reason
:
Literal
[
"stop"
,
"length"
,
"function_call"
]
class
ChatCompletionResponseStreamChoice
(
BaseModel
):
delta
:
DeltaMessage
finish_reason
:
Optional
[
Literal
[
"stop"
,
"length"
,
"function_call"
]]
index
:
int
class
ChatCompletionResponse
(
BaseModel
):
model
:
str
id
:
str
object
:
Literal
[
"chat.completion"
,
"chat.completion.chunk"
]
choices
:
List
[
Union
[
ChatCompletionResponseChoice
,
ChatCompletionResponseStreamChoice
]]
created
:
Optional
[
int
]
=
Field
(
default_factory
=
lambda
:
int
(
time
.
time
()))
usage
:
Optional
[
UsageInfo
]
=
None
class
InvalidScoreLogitsProcessor
(
LogitsProcessor
):
def
__call__
(
self
,
input_ids
:
torch
.
LongTensor
,
scores
:
torch
.
FloatTensor
)
->
torch
.
FloatTensor
:
if
torch
.
isnan
(
scores
).
any
()
or
torch
.
isinf
(
scores
).
any
():
scores
.
zero_
()
scores
[...,
5
]
=
5e4
return
scores
def
process_response
(
output
:
str
,
use_tool
:
bool
=
False
)
->
Union
[
str
,
dict
]:
content
=
""
for
response
in
output
.
split
(
"<|assistant|>"
):
if
"
\n
"
in
response
:
metadata
,
content
=
response
.
split
(
"
\n
"
,
maxsplit
=
1
)
else
:
metadata
,
content
=
""
,
response
if
not
metadata
.
strip
():
content
=
content
.
strip
()
else
:
if
use_tool
:
parameters
=
eval
(
content
.
strip
())
content
=
{
"name"
:
metadata
.
strip
(),
"arguments"
:
json
.
dumps
(
parameters
,
ensure_ascii
=
False
)
}
else
:
content
=
{
"name"
:
metadata
.
strip
(),
"content"
:
content
}
return
content
@
torch
.
inference_mode
()
async
def
generate_stream_glm4
(
params
):
messages
=
params
[
"messages"
]
tools
=
params
[
"tools"
]
tool_choice
=
params
[
"tool_choice"
]
temperature
=
float
(
params
.
get
(
"temperature"
,
1.0
))
repetition_penalty
=
float
(
params
.
get
(
"repetition_penalty"
,
1.0
))
top_p
=
float
(
params
.
get
(
"top_p"
,
1.0
))
max_new_tokens
=
int
(
params
.
get
(
"max_tokens"
,
8192
))
messages
=
process_messages
(
messages
,
tools
=
tools
,
tool_choice
=
tool_choice
)
inputs
=
tokenizer
.
apply_chat_template
(
messages
,
add_generation_prompt
=
True
,
tokenize
=
False
)
params_dict
=
{
"n"
:
1
,
"best_of"
:
1
,
"presence_penalty"
:
1.0
,
"frequency_penalty"
:
0.0
,
"temperature"
:
temperature
,
"top_p"
:
top_p
,
"top_k"
:
-
1
,
"repetition_penalty"
:
repetition_penalty
,
"use_beam_search"
:
False
,
"length_penalty"
:
1
,
"early_stopping"
:
False
,
"stop_token_ids"
:
[
151329
,
151336
,
151338
],
"ignore_eos"
:
False
,
"max_tokens"
:
max_new_tokens
,
"logprobs"
:
None
,
"prompt_logprobs"
:
None
,
"skip_special_tokens"
:
True
,
}
sampling_params
=
SamplingParams
(
**
params_dict
)
async
for
output
in
engine
.
generate
(
inputs
=
inputs
,
sampling_params
=
sampling_params
,
request_id
=
"glm-4-9b"
):
output_len
=
len
(
output
.
outputs
[
0
].
token_ids
)
input_len
=
len
(
output
.
prompt_token_ids
)
ret
=
{
"text"
:
output
.
outputs
[
0
].
text
,
"usage"
:
{
"prompt_tokens"
:
input_len
,
"completion_tokens"
:
output_len
,
"total_tokens"
:
output_len
+
input_len
},
"finish_reason"
:
output
.
outputs
[
0
].
finish_reason
,
}
yield
ret
gc
.
collect
()
torch
.
cuda
.
empty_cache
()
def
process_messages
(
messages
,
tools
=
None
,
tool_choice
=
"none"
):
_messages
=
messages
messages
=
[]
msg_has_sys
=
False
def
filter_tools
(
tool_choice
,
tools
):
function_name
=
tool_choice
.
get
(
'function'
,
{}).
get
(
'name'
,
None
)
if
not
function_name
:
return
[]
filtered_tools
=
[
tool
for
tool
in
tools
if
tool
.
get
(
'function'
,
{}).
get
(
'name'
)
==
function_name
]
return
filtered_tools
if
tool_choice
!=
"none"
:
if
isinstance
(
tool_choice
,
dict
):
tools
=
filter_tools
(
tool_choice
,
tools
)
if
tools
:
messages
.
append
(
{
"role"
:
"system"
,
"content"
:
None
,
"tools"
:
tools
}
)
msg_has_sys
=
True
# add to metadata
if
isinstance
(
tool_choice
,
dict
)
and
tools
:
messages
.
append
(
{
"role"
:
"assistant"
,
"metadata"
:
tool_choice
[
"function"
][
"name"
],
"content"
:
""
}
)
for
m
in
_messages
:
role
,
content
,
func_call
=
m
.
role
,
m
.
content
,
m
.
function_call
if
role
==
"function"
:
messages
.
append
(
{
"role"
:
"observation"
,
"content"
:
content
}
)
elif
role
==
"assistant"
and
func_call
is
not
None
:
for
response
in
content
.
split
(
"<|assistant|>"
):
if
"
\n
"
in
response
:
metadata
,
sub_content
=
response
.
split
(
"
\n
"
,
maxsplit
=
1
)
else
:
metadata
,
sub_content
=
""
,
response
messages
.
append
(
{
"role"
:
role
,
"metadata"
:
metadata
,
"content"
:
sub_content
.
strip
()
}
)
else
:
if
role
==
"system"
and
msg_has_sys
:
msg_has_sys
=
False
continue
messages
.
append
({
"role"
:
role
,
"content"
:
content
})
return
messages
@
app
.
get
(
"/health"
)
async
def
health
()
->
Response
:
"""Health check."""
return
Response
(
status_code
=
200
)
@
app
.
get
(
"/v1/models"
,
response_model
=
ModelList
)
async
def
list_models
():
model_card
=
ModelCard
(
id
=
"glm-4"
)
return
ModelList
(
data
=
[
model_card
])
@
app
.
post
(
"/v1/chat/completions"
,
response_model
=
ChatCompletionResponse
)
async
def
create_chat_completion
(
request
:
ChatCompletionRequest
):
if
len
(
request
.
messages
)
<
1
or
request
.
messages
[
-
1
].
role
==
"assistant"
:
raise
HTTPException
(
status_code
=
400
,
detail
=
"Invalid request"
)
gen_params
=
dict
(
messages
=
request
.
messages
,
temperature
=
request
.
temperature
,
top_p
=
request
.
top_p
,
max_tokens
=
request
.
max_tokens
or
1024
,
echo
=
False
,
stream
=
request
.
stream
,
repetition_penalty
=
request
.
repetition_penalty
,
tools
=
request
.
tools
,
tool_choice
=
request
.
tool_choice
,
)
logger
.
debug
(
f
"==== request ====
\n
{
gen_params
}
"
)
if
request
.
stream
:
predict_stream_generator
=
predict_stream
(
request
.
model
,
gen_params
)
output
=
await
anext
(
predict_stream_generator
)
if
output
:
return
EventSourceResponse
(
predict_stream_generator
,
media_type
=
"text/event-stream"
)
logger
.
debug
(
f
"First result output:
\n
{
output
}
"
)
function_call
=
None
if
output
and
request
.
tools
:
try
:
function_call
=
process_response
(
output
,
use_tool
=
True
)
except
:
logger
.
warning
(
"Failed to parse tool call"
)
# CallFunction
if
isinstance
(
function_call
,
dict
):
function_call
=
FunctionCallResponse
(
**
function_call
)
tool_response
=
""
if
not
gen_params
.
get
(
"messages"
):
gen_params
[
"messages"
]
=
[]
gen_params
[
"messages"
].
append
(
ChatMessage
(
role
=
"assistant"
,
content
=
output
))
gen_params
[
"messages"
].
append
(
ChatMessage
(
role
=
"tool"
,
name
=
function_call
.
name
,
content
=
tool_response
))
generate
=
predict
(
request
.
model
,
gen_params
)
return
EventSourceResponse
(
generate
,
media_type
=
"text/event-stream"
)
else
:
generate
=
parse_output_text
(
request
.
model
,
output
)
return
EventSourceResponse
(
generate
,
media_type
=
"text/event-stream"
)
response
=
""
async
for
response
in
generate_stream_glm4
(
gen_params
):
pass
if
response
[
"text"
].
startswith
(
"
\n
"
):
response
[
"text"
]
=
response
[
"text"
][
1
:]
response
[
"text"
]
=
response
[
"text"
].
strip
()
usage
=
UsageInfo
()
function_call
,
finish_reason
=
None
,
"stop"
if
request
.
tools
:
try
:
function_call
=
process_response
(
response
[
"text"
],
use_tool
=
True
)
except
:
logger
.
warning
(
"Failed to parse tool call, maybe the response is not a function call(such as cogview drawing) or have been answered."
)
if
isinstance
(
function_call
,
dict
):
finish_reason
=
"function_call"
function_call
=
FunctionCallResponse
(
**
function_call
)
message
=
ChatMessage
(
role
=
"assistant"
,
content
=
response
[
"text"
],
function_call
=
function_call
if
isinstance
(
function_call
,
FunctionCallResponse
)
else
None
,
)
logger
.
debug
(
f
"==== message ====
\n
{
message
}
"
)
choice_data
=
ChatCompletionResponseChoice
(
index
=
0
,
message
=
message
,
finish_reason
=
finish_reason
,
)
task_usage
=
UsageInfo
.
model_validate
(
response
[
"usage"
])
for
usage_key
,
usage_value
in
task_usage
.
model_dump
().
items
():
setattr
(
usage
,
usage_key
,
getattr
(
usage
,
usage_key
)
+
usage_value
)
return
ChatCompletionResponse
(
model
=
request
.
model
,
id
=
""
,
# for open_source model, id is empty
choices
=
[
choice_data
],
object
=
"chat.completion"
,
usage
=
usage
)
async
def
predict
(
model_id
:
str
,
params
:
dict
):
choice_data
=
ChatCompletionResponseStreamChoice
(
index
=
0
,
delta
=
DeltaMessage
(
role
=
"assistant"
),
finish_reason
=
None
)
chunk
=
ChatCompletionResponse
(
model
=
model_id
,
id
=
""
,
choices
=
[
choice_data
],
object
=
"chat.completion.chunk"
)
yield
"{}"
.
format
(
chunk
.
model_dump_json
(
exclude_unset
=
True
))
previous_text
=
""
async
for
new_response
in
generate_stream_glm4
(
params
):
decoded_unicode
=
new_response
[
"text"
]
delta_text
=
decoded_unicode
[
len
(
previous_text
):]
previous_text
=
decoded_unicode
finish_reason
=
new_response
[
"finish_reason"
]
if
len
(
delta_text
)
==
0
and
finish_reason
!=
"function_call"
:
continue
function_call
=
None
if
finish_reason
==
"function_call"
:
try
:
function_call
=
process_response
(
decoded_unicode
,
use_tool
=
True
)
except
:
logger
.
warning
(
"Failed to parse tool call, maybe the response is not a tool call or have been answered."
)
if
isinstance
(
function_call
,
dict
):
function_call
=
FunctionCallResponse
(
**
function_call
)
delta
=
DeltaMessage
(
content
=
delta_text
,
role
=
"assistant"
,
function_call
=
function_call
if
isinstance
(
function_call
,
FunctionCallResponse
)
else
None
,
)
choice_data
=
ChatCompletionResponseStreamChoice
(
index
=
0
,
delta
=
delta
,
finish_reason
=
finish_reason
)
chunk
=
ChatCompletionResponse
(
model
=
model_id
,
id
=
""
,
choices
=
[
choice_data
],
object
=
"chat.completion.chunk"
)
yield
"{}"
.
format
(
chunk
.
model_dump_json
(
exclude_unset
=
True
))
choice_data
=
ChatCompletionResponseStreamChoice
(
index
=
0
,
delta
=
DeltaMessage
(),
finish_reason
=
"stop"
)
chunk
=
ChatCompletionResponse
(
model
=
model_id
,
id
=
""
,
choices
=
[
choice_data
],
object
=
"chat.completion.chunk"
)
yield
"{}"
.
format
(
chunk
.
model_dump_json
(
exclude_unset
=
True
))
yield
'[DONE]'
async
def
predict_stream
(
model_id
,
gen_params
):
output
=
""
is_function_call
=
False
has_send_first_chunk
=
False
async
for
new_response
in
generate_stream_glm4
(
gen_params
):
decoded_unicode
=
new_response
[
"text"
]
delta_text
=
decoded_unicode
[
len
(
output
):]
output
=
decoded_unicode
if
not
is_function_call
and
len
(
output
)
>
7
:
is_function_call
=
output
and
'get_'
in
output
if
is_function_call
:
continue
finish_reason
=
new_response
[
"finish_reason"
]
if
not
has_send_first_chunk
:
message
=
DeltaMessage
(
content
=
""
,
role
=
"assistant"
,
function_call
=
None
,
)
choice_data
=
ChatCompletionResponseStreamChoice
(
index
=
0
,
delta
=
message
,
finish_reason
=
finish_reason
)
chunk
=
ChatCompletionResponse
(
model
=
model_id
,
id
=
""
,
choices
=
[
choice_data
],
created
=
int
(
time
.
time
()),
object
=
"chat.completion.chunk"
)
yield
"{}"
.
format
(
chunk
.
model_dump_json
(
exclude_unset
=
True
))
send_msg
=
delta_text
if
has_send_first_chunk
else
output
has_send_first_chunk
=
True
message
=
DeltaMessage
(
content
=
send_msg
,
role
=
"assistant"
,
function_call
=
None
,
)
choice_data
=
ChatCompletionResponseStreamChoice
(
index
=
0
,
delta
=
message
,
finish_reason
=
finish_reason
)
chunk
=
ChatCompletionResponse
(
model
=
model_id
,
id
=
""
,
choices
=
[
choice_data
],
created
=
int
(
time
.
time
()),
object
=
"chat.completion.chunk"
)
yield
"{}"
.
format
(
chunk
.
model_dump_json
(
exclude_unset
=
True
))
if
is_function_call
:
yield
output
else
:
yield
'[DONE]'
async
def
parse_output_text
(
model_id
:
str
,
value
:
str
):
choice_data
=
ChatCompletionResponseStreamChoice
(
index
=
0
,
delta
=
DeltaMessage
(
role
=
"assistant"
,
content
=
value
),
finish_reason
=
None
)
chunk
=
ChatCompletionResponse
(
model
=
model_id
,
id
=
""
,
choices
=
[
choice_data
],
object
=
"chat.completion.chunk"
)
yield
"{}"
.
format
(
chunk
.
model_dump_json
(
exclude_unset
=
True
))
choice_data
=
ChatCompletionResponseStreamChoice
(
index
=
0
,
delta
=
DeltaMessage
(),
finish_reason
=
"stop"
)
chunk
=
ChatCompletionResponse
(
model
=
model_id
,
id
=
""
,
choices
=
[
choice_data
],
object
=
"chat.completion.chunk"
)
yield
"{}"
.
format
(
chunk
.
model_dump_json
(
exclude_unset
=
True
))
yield
'[DONE]'
if
__name__
==
"__main__"
:
tokenizer
=
AutoTokenizer
.
from_pretrained
(
MODEL_PATH
,
trust_remote_code
=
True
)
engine_args
=
AsyncEngineArgs
(
model
=
MODEL_PATH
,
tokenizer
=
MODEL_PATH
,
tensor_parallel_size
=
1
,
dtype
=
"bfloat16"
,
trust_remote_code
=
True
,
gpu_memory_utilization
=
0.9
,
enforce_eager
=
True
,
worker_use_ray
=
True
,
engine_use_ray
=
False
,
disable_log_requests
=
True
,
max_model_len
=
MAX_MODEL_LENGTH
,
)
engine
=
AsyncLLMEngine
.
from_engine_args
(
engine_args
)
uvicorn
.
run
(
app
,
host
=
'0.0.0.0'
,
port
=
8000
,
workers
=
1
)
basic_demo/trans_batch_demo.py
deleted
100644 → 0
View file @
7f9c28a1
"""
Here is an example of using batch request glm-4-9b,
here you need to build the conversation format yourself and then call the batch function to make batch requests.
Please note that in this demo, the memory consumption is significantly higher.
"""
from
typing
import
Optional
,
Union
from
transformers
import
AutoModel
,
AutoTokenizer
,
LogitsProcessorList
MODEL_PATH
=
'THUDM/glm-4-9b-chat'
tokenizer
=
AutoTokenizer
.
from_pretrained
(
MODEL_PATH
,
trust_remote_code
=
True
,
encode_special_tokens
=
True
)
model
=
AutoModel
.
from_pretrained
(
MODEL_PATH
,
trust_remote_code
=
True
,
device_map
=
"auto"
).
eval
()
def
process_model_outputs
(
inputs
,
outputs
,
tokenizer
):
responses
=
[]
for
input_ids
,
output_ids
in
zip
(
inputs
.
input_ids
,
outputs
):
response
=
tokenizer
.
decode
(
output_ids
[
len
(
input_ids
):],
skip_special_tokens
=
True
).
strip
()
responses
.
append
(
response
)
return
responses
def
batch
(
model
,
tokenizer
,
messages
:
Union
[
str
,
list
[
str
]],
max_input_tokens
:
int
=
8192
,
max_new_tokens
:
int
=
8192
,
num_beams
:
int
=
1
,
do_sample
:
bool
=
True
,
top_p
:
float
=
0.8
,
temperature
:
float
=
0.8
,
logits_processor
:
Optional
[
LogitsProcessorList
]
=
LogitsProcessorList
(),
):
messages
=
[
messages
]
if
isinstance
(
messages
,
str
)
else
messages
batched_inputs
=
tokenizer
(
messages
,
return_tensors
=
"pt"
,
padding
=
"max_length"
,
truncation
=
True
,
max_length
=
max_input_tokens
).
to
(
model
.
device
)
gen_kwargs
=
{
"max_new_tokens"
:
max_new_tokens
,
"num_beams"
:
num_beams
,
"do_sample"
:
do_sample
,
"top_p"
:
top_p
,
"temperature"
:
temperature
,
"logits_processor"
:
logits_processor
,
"eos_token_id"
:
model
.
config
.
eos_token_id
}
batched_outputs
=
model
.
generate
(
**
batched_inputs
,
**
gen_kwargs
)
batched_response
=
process_model_outputs
(
batched_inputs
,
batched_outputs
,
tokenizer
)
return
batched_response
if
__name__
==
"__main__"
:
batch_message
=
[
[
{
"role"
:
"user"
,
"content"
:
"我的爸爸和妈妈结婚为什么不能带我去"
},
{
"role"
:
"assistant"
,
"content"
:
"因为他们结婚时你还没有出生"
},
{
"role"
:
"user"
,
"content"
:
"我刚才的提问是"
}
],
[
{
"role"
:
"user"
,
"content"
:
"你好,你是谁"
}
]
]
batch_inputs
=
[]
max_input_tokens
=
1024
for
i
,
messages
in
enumerate
(
batch_message
):
new_batch_input
=
tokenizer
.
apply_chat_template
(
messages
,
add_generation_prompt
=
True
,
tokenize
=
False
)
max_input_tokens
=
max
(
max_input_tokens
,
len
(
new_batch_input
))
batch_inputs
.
append
(
new_batch_input
)
gen_kwargs
=
{
"max_input_tokens"
:
max_input_tokens
,
"max_new_tokens"
:
8192
,
"do_sample"
:
True
,
"top_p"
:
0.8
,
"temperature"
:
0.8
,
"num_beams"
:
1
,
}
batch_responses
=
batch
(
model
,
tokenizer
,
batch_inputs
,
**
gen_kwargs
)
for
response
in
batch_responses
:
print
(
"="
*
10
)
print
(
response
)
basic_demo/trans_stress_test.py
deleted
100644 → 0
View file @
7f9c28a1
import
argparse
import
time
from
transformers
import
AutoModelForCausalLM
,
AutoTokenizer
,
TextIteratorStreamer
,
BitsAndBytesConfig
import
torch
from
threading
import
Thread
MODEL_PATH
=
'THUDM/glm-4-9b-chat'
def
stress_test
(
token_len
,
n
,
num_gpu
):
device
=
torch
.
device
(
f
"cuda:
{
num_gpu
-
1
}
"
if
torch
.
cuda
.
is_available
()
and
num_gpu
>
0
else
"cpu"
)
tokenizer
=
AutoTokenizer
.
from_pretrained
(
MODEL_PATH
,
trust_remote_code
=
True
,
padding_side
=
"left"
)
model
=
AutoModelForCausalLM
.
from_pretrained
(
MODEL_PATH
,
trust_remote_code
=
True
,
# quantization_config=BitsAndBytesConfig(load_in_4bit=True),
# low_cpu_mem_usage=True,
torch_dtype
=
torch
.
bfloat16
).
to
(
device
).
eval
()
times
=
[]
decode_times
=
[]
print
(
"Warming up..."
)
vocab_size
=
tokenizer
.
vocab_size
warmup_token_len
=
20
random_token_ids
=
torch
.
randint
(
3
,
vocab_size
-
200
,
(
warmup_token_len
-
5
,),
dtype
=
torch
.
long
)
start_tokens
=
[
151331
,
151333
,
151336
,
198
]
end_tokens
=
[
151337
]
input_ids
=
torch
.
tensor
(
start_tokens
+
random_token_ids
.
tolist
()
+
end_tokens
,
dtype
=
torch
.
long
).
unsqueeze
(
0
).
to
(
device
)
attention_mask
=
torch
.
ones_like
(
input_ids
,
dtype
=
torch
.
bfloat16
).
to
(
device
)
position_ids
=
torch
.
arange
(
len
(
input_ids
[
0
]),
dtype
=
torch
.
bfloat16
).
unsqueeze
(
0
).
to
(
device
)
warmup_inputs
=
{
'input_ids'
:
input_ids
,
'attention_mask'
:
attention_mask
,
'position_ids'
:
position_ids
}
with
torch
.
no_grad
():
_
=
model
.
generate
(
input_ids
=
warmup_inputs
[
'input_ids'
],
attention_mask
=
warmup_inputs
[
'attention_mask'
],
max_new_tokens
=
2048
,
do_sample
=
False
,
repetition_penalty
=
1.0
,
eos_token_id
=
[
151329
,
151336
,
151338
]
)
print
(
"Warming up complete. Starting stress test..."
)
for
i
in
range
(
n
):
random_token_ids
=
torch
.
randint
(
3
,
vocab_size
-
200
,
(
token_len
-
5
,),
dtype
=
torch
.
long
)
input_ids
=
torch
.
tensor
(
start_tokens
+
random_token_ids
.
tolist
()
+
end_tokens
,
dtype
=
torch
.
long
).
unsqueeze
(
0
).
to
(
device
)
attention_mask
=
torch
.
ones_like
(
input_ids
,
dtype
=
torch
.
bfloat16
).
to
(
device
)
position_ids
=
torch
.
arange
(
len
(
input_ids
[
0
]),
dtype
=
torch
.
bfloat16
).
unsqueeze
(
0
).
to
(
device
)
test_inputs
=
{
'input_ids'
:
input_ids
,
'attention_mask'
:
attention_mask
,
'position_ids'
:
position_ids
}
streamer
=
TextIteratorStreamer
(
tokenizer
=
tokenizer
,
timeout
=
36000
,
skip_prompt
=
True
,
skip_special_tokens
=
True
)
generate_kwargs
=
{
"input_ids"
:
test_inputs
[
'input_ids'
],
"attention_mask"
:
test_inputs
[
'attention_mask'
],
"max_new_tokens"
:
512
,
"do_sample"
:
False
,
"repetition_penalty"
:
1.0
,
"eos_token_id"
:
[
151329
,
151336
,
151338
],
"streamer"
:
streamer
}
start_time
=
time
.
time
()
t
=
Thread
(
target
=
model
.
generate
,
kwargs
=
generate_kwargs
)
t
.
start
()
first_token_time
=
None
all_token_times
=
[]
for
token
in
streamer
:
current_time
=
time
.
time
()
if
first_token_time
is
None
:
first_token_time
=
current_time
times
.
append
(
first_token_time
-
start_time
)
all_token_times
.
append
(
current_time
)
t
.
join
()
end_time
=
time
.
time
()
avg_decode_time_per_token
=
len
(
all_token_times
)
/
(
end_time
-
first_token_time
)
if
all_token_times
else
0
decode_times
.
append
(
avg_decode_time_per_token
)
print
(
f
"Iteration
{
i
+
1
}
/
{
n
}
- Prefilling Time:
{
times
[
-
1
]:.
4
f
}
seconds - Average Decode Time:
{
avg_decode_time_per_token
:.
4
f
}
tokens/second"
)
torch
.
cuda
.
empty_cache
()
avg_first_token_time
=
sum
(
times
)
/
n
avg_decode_time
=
sum
(
decode_times
)
/
n
print
(
f
"
\n
Average First Token Time over
{
n
}
iterations:
{
avg_first_token_time
:.
4
f
}
seconds"
)
print
(
f
"Average Decode Time per Token over
{
n
}
iterations:
{
avg_decode_time
:.
4
f
}
tokens/second"
)
return
times
,
avg_first_token_time
,
decode_times
,
avg_decode_time
def
main
():
parser
=
argparse
.
ArgumentParser
(
description
=
"Stress test for model inference"
)
parser
.
add_argument
(
'--token_len'
,
type
=
int
,
default
=
1000
,
help
=
'Number of tokens for each test'
)
parser
.
add_argument
(
'--n'
,
type
=
int
,
default
=
3
,
help
=
'Number of iterations for the stress test'
)
parser
.
add_argument
(
'--num_gpu'
,
type
=
int
,
default
=
1
,
help
=
'Number of GPUs to use for inference'
)
args
=
parser
.
parse_args
()
token_len
=
args
.
token_len
n
=
args
.
n
num_gpu
=
args
.
num_gpu
stress_test
(
token_len
,
n
,
num_gpu
)
if
__name__
==
"__main__"
:
main
()
basic_demo/vllm_cli_demo.py
deleted
100644 → 0
View file @
7f9c28a1
"""
This script creates a CLI demo with vllm backand for the glm-4-9b model,
allowing users to interact with the model through a command-line interface.
Usage:
- Run the script to start the CLI demo.
- Interact with the model by typing questions and receiving responses.
Note: The script includes a modification to handle markdown to plain text conversion,
ensuring that the CLI interface displays formatted text correctly.
"""
import
time
import
asyncio
import
argparse
from
transformers
import
AutoTokenizer
from
vllm
import
SamplingParams
,
AsyncEngineArgs
,
AsyncLLMEngine
from
typing
import
List
,
Dict
# add model path
parser
=
argparse
.
ArgumentParser
()
parser
.
add_argument
(
'--model_name_or_path'
,
default
=
'THUDM/glm-4-9b'
)
args
=
parser
.
parse_args
()
# MODEL_PATH = 'THUDM/glm-4-9b'
MODEL_PATH
=
args
.
model_name_or_path
def
load_model_and_tokenizer
(
model_dir
:
str
):
engine_args
=
AsyncEngineArgs
(
model
=
model_dir
,
tokenizer
=
model_dir
,
tensor_parallel_size
=
1
,
dtype
=
"bfloat16"
,
trust_remote_code
=
True
,
gpu_memory_utilization
=
0.3
,
enforce_eager
=
True
,
worker_use_ray
=
True
,
engine_use_ray
=
False
,
disable_log_requests
=
True
# 如果遇见 OOM 现象,建议开启下述参数
# enable_chunked_prefill=True,
# max_num_batched_tokens=8192
)
tokenizer
=
AutoTokenizer
.
from_pretrained
(
model_dir
,
trust_remote_code
=
True
,
encode_special_tokens
=
True
)
engine
=
AsyncLLMEngine
.
from_engine_args
(
engine_args
)
return
engine
,
tokenizer
engine
,
tokenizer
=
load_model_and_tokenizer
(
MODEL_PATH
)
async
def
vllm_gen
(
messages
:
List
[
Dict
[
str
,
str
]],
top_p
:
float
,
temperature
:
float
,
max_dec_len
:
int
):
inputs
=
tokenizer
.
apply_chat_template
(
messages
,
add_generation_prompt
=
True
,
tokenize
=
False
)
params_dict
=
{
"n"
:
1
,
"best_of"
:
1
,
"presence_penalty"
:
1.0
,
"frequency_penalty"
:
0.0
,
"temperature"
:
temperature
,
"top_p"
:
top_p
,
"top_k"
:
-
1
,
"use_beam_search"
:
False
,
"length_penalty"
:
1
,
"early_stopping"
:
False
,
"stop_token_ids"
:
[
151329
,
151336
,
151338
],
"ignore_eos"
:
False
,
"max_tokens"
:
max_dec_len
,
"logprobs"
:
None
,
"prompt_logprobs"
:
None
,
"skip_special_tokens"
:
True
,
}
sampling_params
=
SamplingParams
(
**
params_dict
)
async
for
output
in
engine
.
generate
(
inputs
=
inputs
,
sampling_params
=
sampling_params
,
request_id
=
f
"
{
time
.
time
()
}
"
):
yield
output
.
outputs
[
0
].
text
async
def
chat
():
history
=
[]
max_length
=
8192
top_p
=
0.8
temperature
=
0.6
print
(
"Welcome to the GLM-4-9B CLI chat. Type your messages below."
)
while
True
:
user_input
=
input
(
"
\n
You: "
)
if
user_input
.
lower
()
in
[
"exit"
,
"quit"
]:
break
history
.
append
([
user_input
,
""
])
messages
=
[]
for
idx
,
(
user_msg
,
model_msg
)
in
enumerate
(
history
):
if
idx
==
len
(
history
)
-
1
and
not
model_msg
:
messages
.
append
({
"role"
:
"user"
,
"content"
:
user_msg
})
break
if
user_msg
:
messages
.
append
({
"role"
:
"user"
,
"content"
:
user_msg
})
if
model_msg
:
messages
.
append
({
"role"
:
"assistant"
,
"content"
:
model_msg
})
print
(
"
\n
GLM-4: "
,
end
=
""
)
current_length
=
0
output
=
""
async
for
output
in
vllm_gen
(
messages
,
top_p
,
temperature
,
max_length
):
print
(
output
[
current_length
:],
end
=
""
,
flush
=
True
)
current_length
=
len
(
output
)
history
[
-
1
][
1
]
=
output
if
__name__
==
"__main__"
:
asyncio
.
run
(
chat
())
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment