README.md 3.43 KB
Newer Older
Lianmin Zheng's avatar
Lianmin Zheng committed
1
2
3
4
# Benchmark Latency and Throughput

## SGLang

Lianmin Zheng's avatar
Lianmin Zheng committed
5
### Launch a server
Lianmin Zheng's avatar
Lianmin Zheng committed
6
```
Lianmin Zheng's avatar
Lianmin Zheng committed
7
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
Lianmin Zheng's avatar
Lianmin Zheng committed
8
```
Ying Sheng's avatar
Ying Sheng committed
9

Lianmin Zheng's avatar
Lianmin Zheng committed
10
### Benchmark one batch
Lianmin Zheng's avatar
Lianmin Zheng committed
11
12

```
Lianmin Zheng's avatar
Lianmin Zheng committed
13
14
python3 bench_one.py
python3 bench_one.py --batch-size 64
Lianmin Zheng's avatar
Lianmin Zheng committed
15
16
```

Lianmin Zheng's avatar
Lianmin Zheng committed
17
18
### Benchmark online serving with many requests

Lianmin Zheng's avatar
Lianmin Zheng committed
19
```
Lianmin Zheng's avatar
Lianmin Zheng committed
20
python3 bench_serving.py --backend srt --port 30000 --tokenizer meta-llama/Llama-2-7b-chat-hf --num-prompt 1000 --request-rate 100 --input-len 1024 --output-len 256
Lianmin Zheng's avatar
Lianmin Zheng committed
21
22
```

Lianmin Zheng's avatar
Lianmin Zheng committed
23
24
25
### Benchmark online serving on the ShareGPT dataset

#### Download data
Ying Sheng's avatar
Ying Sheng committed
26
```
Lianmin Zheng's avatar
Lianmin Zheng committed
27
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
Ying Sheng's avatar
Ying Sheng committed
28
29
```

Lianmin Zheng's avatar
Lianmin Zheng committed
30
31
#### Run ShareGPT
```
32
python3 bench_serving.py --backend srt --port 30000 --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --request-rate 10
Lianmin Zheng's avatar
Lianmin Zheng committed
33
34
```

Lianmin Zheng's avatar
Lianmin Zheng committed
35
### Profile with Nsight
Yineng Zhang's avatar
Yineng Zhang committed
36
37
38
39
40
41
42
43
44
45
46
47
0. Prerequisite
```bash
# install nsys
# https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html
apt update
apt install -y --no-install-recommends gnupg
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
apt update
apt install nsight-systems-cli
```

48
1. To profile a single batch, use `nsys profile --cuda-graph-trace=node python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512`
Yineng Zhang's avatar
Yineng Zhang committed
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71

2. To profile a server, e.g.

```bash
# server
# set the delay and duration times according to needs
nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --disable-radix-cache

# client
python3 -m sglang.bench_serving --backend sglang --num-prompts 6000 --dataset-name random --random-input 4096 --random-output 2048
```

3. Use NVTX, e.g.

```bash
# install nvtx
pip install nvtx

# code snippets
import nvtx
with nvtx.annotate("description", color="color"):
    # some critical code
```
Lianmin Zheng's avatar
Lianmin Zheng committed
72
73


Lianmin Zheng's avatar
Lianmin Zheng committed
74
## Other baselines
Lianmin Zheng's avatar
Lianmin Zheng committed
75
76
77

### vLLM
```
Ying Sheng's avatar
Ying Sheng committed
78
python3 -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel 1 --disable-log-requests --swap-space 16 --port 21000
Lianmin Zheng's avatar
Lianmin Zheng committed
79
80
81
```

```
Lianmin Zheng's avatar
Lianmin Zheng committed
82
# run synthetic
83
python3 bench_serving.py --backend vllm --port 30000 --tokenizer meta-llama/Llama-2-7b-chat-hf --num-prompt 1000 --request-rate 100 --input-len 1024 --output-len 256
Lianmin Zheng's avatar
Lianmin Zheng committed
84
85
```

Ying Sheng's avatar
Ying Sheng committed
86
```
Lianmin Zheng's avatar
Lianmin Zheng committed
87
# run ShareGPT
88
python3 bench_serving.py --backend vllm --port 21000 --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --request-rate 10
Ying Sheng's avatar
Ying Sheng committed
89
90
```

91
92
93
94
95
96
```
# run one batch
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B --tensor 8 --disable-log-requests --max-num-seqs 1024 --quantization fp8

python3 bench_one.py --input-len 1024 --batch-size 1 1 2 4 8 16 32 64 128 256 512 768 1024 --port 8000 --backend vllm
```
Lianmin Zheng's avatar
Lianmin Zheng committed
97
98
99
100
101
102
103

### LightLLM
```
python -m lightllm.server.api_server --model_dir ~/model_weights/Llama-2-7b-chat-hf --max_total_token_num 15600 --tokenizer_mode auto --port 22000
```

```
104
python3 bench_serving.py --backend lightllm --port 22000 --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --request-rate 10
Lianmin Zheng's avatar
Lianmin Zheng committed
105
```