Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
change
sglang
Commits
945aa9be
Unverified
Commit
945aa9be
authored
Jun 27, 2024
by
Lianmin Zheng
Committed by
GitHub
Jun 27, 2024
Browse files
Update readme (#568)
parent
2e6e62e1
Changes
3
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
40 additions
and
25 deletions
+40
-25
benchmark/latency_throughput/README.md
benchmark/latency_throughput/README.md
+29
-18
benchmark/latency_throughput/bench_one.py
benchmark/latency_throughput/bench_one.py
+1
-1
benchmark/latency_throughput/bench_serving.py
benchmark/latency_throughput/bench_serving.py
+10
-6
No files found.
benchmark/latency_throughput/README.md
View file @
945aa9be
### Download data
# Benchmark Latency and Throughput
## SGLang
### Launch server
```
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
```
Install
[
FlashInfer
](
https://github.com/flashinfer-ai/flashinfer
)
if you want it to be enabled.
### Benchmark one batch
### SGLang
```
# use native attention
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --tp 1 --port 30000
# use flashinfer attention: --enable-flashinfer
# disable RadixAttention: --disable-radix-cache
python3 bench_one.py
python3 bench_one.py --batch-size 64
```
### Benchmark online serving with many requests
```
# run ShareGPT
python3 bench_throughput.py --backend srt --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --request-rate 10 --port 30000
python3 bench_serving.py --backend srt --port 30000 --tokenizer meta-llama/Llama-2-7b-chat-hf --num-prompt 1000 --request-rate 100 --input-len 1024 --output-len 256
```
### Benchmark online serving on the ShareGPT dataset
#### Download data
```
# run synthetic
python3 bench_throughput.py --backend srt --tokenizer meta-llama/Llama-2-7b-chat-hf --num-prompt 1000 --request-rate 100 --input-len 1024 --output-len 256 --port 30000
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```
#### Run ShareGPT
```
python3 bench_throughput.py --backend srt --port 30000 --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --request-rate 10
```
## Other baselines
### vLLM
```
...
...
@@ -30,13 +41,13 @@ python3 -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-chat-hf --t
```
```
# run
ShareGPT
python3 bench_throughput.py --backend vllm --tokenizer meta-llama/Llama-2-7b-chat-hf
--dataset ShareGPT_V3_unfiltered_cleaned_split.json
--num-prompt
s
10 --request-rate 10 --
port 21000
# run
synthetic
python3 bench_throughput.py --backend vllm
--port 30000
--tokenizer meta-llama/Llama-2-7b-chat-hf --num-prompt 10
00
--request-rate 10
0
--
input-len 1024 --output-len 256
```
```
# run
synthetic
python3 bench_throughput.py --backend vllm --tokenizer meta-llama/Llama-2-7b-chat-hf --num-prompt 10
00
--request-rate 10
0 --input-len 1024 --output-len 256 --port 30000
# run
ShareGPT
python3 bench_throughput.py --backend vllm
--port 21000
--tokenizer meta-llama/Llama-2-7b-chat-hf
--dataset ShareGPT_V3_unfiltered_cleaned_split.json
--num-prompt
s
10 --request-rate 10
```
...
...
@@ -46,5 +57,5 @@ python -m lightllm.server.api_server --model_dir ~/model_weights/Llama-2-7b-chat
```
```
python3 bench_throughput.py --backend lightllm --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --request-rate 10
--port 22000
```
python3 bench_throughput.py --backend lightllm
--port 22000
--tokenizer meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --request-rate 10
```
\ No newline at end of file
benchmark/latency_throughput/
test_latency
.py
→
benchmark/latency_throughput/
bench_one
.py
View file @
945aa9be
...
...
@@ -92,4 +92,4 @@ if __name__ == "__main__":
print
(
ret
)
speed
=
args
.
batch_size
*
max_new_tokens
/
latency
print
(
f
"latency:
{
latency
:.
2
f
}
s, speed:
{
speed
:.
2
f
}
token/s"
)
print
(
f
"latency:
{
latency
:.
2
f
}
s, speed:
{
speed
:.
2
f
}
token/s"
)
\ No newline at end of file
benchmark/latency_throughput/bench_
throughput
.py
→
benchmark/latency_throughput/bench_
serving
.py
View file @
945aa9be
...
...
@@ -296,23 +296,27 @@ def main(args: argparse.Namespace):
)
benchmark_end_time
=
time
.
perf_counter
()
benchmark_time
=
benchmark_end_time
-
benchmark_start_time
print
(
f
"Total time:
{
benchmark_time
:.
2
f
}
s"
)
print
(
f
"Throughput:
{
args
.
num_prompts
/
benchmark_time
:.
2
f
}
requests/s"
)
# Compute the
latency
statistics.
# Compute the statistics.
avg_latency
=
np
.
mean
([
latency
for
_
,
_
,
latency
in
REQUEST_LATENCY
])
print
(
f
"Average latency:
{
avg_latency
:.
2
f
}
s"
)
avg_per_token_latency
=
np
.
mean
(
[
latency
/
(
prompt_len
+
output_len
)
for
prompt_len
,
output_len
,
latency
in
REQUEST_LATENCY
]
)
print
(
f
"Average latency per token:
{
avg_per_token_latency
:.
2
f
}
s"
)
avg_per_output_token_latency
=
np
.
mean
(
[
latency
/
output_len
for
_
,
output_len
,
latency
in
REQUEST_LATENCY
]
)
print
(
"Average latency per output token: "
f
"
{
avg_per_output_token_latency
:.
2
f
}
s"
)
decoding_throughput
=
np
.
sum
([
output_len
for
_
,
output_len
,
_
in
REQUEST_LATENCY
])
/
benchmark_time
print
(
f
"Total time:
{
benchmark_time
:.
2
f
}
s"
)
print
(
f
"Request throughput:
{
args
.
num_prompts
/
benchmark_time
:.
2
f
}
requests/s"
)
print
(
f
"Decoding throughput:
{
decoding_throughput
:.
2
f
}
token/s"
)
print
(
f
"Average latency:
{
avg_latency
:.
2
f
}
s"
)
print
(
f
"Average latency per token:
{
avg_per_token_latency
:.
2
f
}
s"
)
print
(
f
"Average latency per output token:
{
avg_per_output_token_latency
:.
2
f
}
s"
)
if
__name__
==
"__main__"
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment