README.md 1.79 KB
Newer Older
Lianmin Zheng's avatar
Lianmin Zheng committed
1
2
3
4
### Download data
```
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```
Ying Sheng's avatar
Ying Sheng committed
5
6
Install [FlashInfer](https://github.com/flashinfer-ai/flashinfer) if you want it to be enabled.

Lianmin Zheng's avatar
Lianmin Zheng committed
7
8
9

### SGLang
```
Ying Sheng's avatar
Ying Sheng committed
10
11
12
13
# use native attention
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --tp 1 --port 30000
# use flashinfer attention: --enable-flashinfer
# disable RadixAttention: --disable-radix-cache
Lianmin Zheng's avatar
Lianmin Zheng committed
14
15
16
```

```
Ying Sheng's avatar
Ying Sheng committed
17
# run ShareGPT
Lianmin Zheng's avatar
Lianmin Zheng committed
18
19
20
python3 bench_throughput.py --backend srt --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --request-rate 10 --port 30000
```

Ying Sheng's avatar
Ying Sheng committed
21
22
```
# run synthetic
23
python3 bench_throughput.py --backend srt --tokenizer meta-llama/Llama-2-7b-chat-hf --num-prompt 1000 --request-rate 100 --input-len 1024 --output-len 256 --port 30000
Ying Sheng's avatar
Ying Sheng committed
24
25
```

Lianmin Zheng's avatar
Lianmin Zheng committed
26
27
28

### vLLM
```
Ying Sheng's avatar
Ying Sheng committed
29
python3 -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel 1 --disable-log-requests --swap-space 16 --port 21000
Lianmin Zheng's avatar
Lianmin Zheng committed
30
31
32
```

```
Ying Sheng's avatar
Ying Sheng committed
33
# run ShareGPT
34
python3 bench_throughput.py --backend vllm --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --request-rate 10 --port 21000
Lianmin Zheng's avatar
Lianmin Zheng committed
35
36
```

Ying Sheng's avatar
Ying Sheng committed
37
38
```
# run synthetic
39
python3 bench_throughput.py --backend vllm --tokenizer meta-llama/Llama-2-7b-chat-hf --num-prompt 1000 --request-rate 100 --input-len 1024 --output-len 256 --port 30000
Ying Sheng's avatar
Ying Sheng committed
40
41
```

Lianmin Zheng's avatar
Lianmin Zheng committed
42
43
44
45
46
47
48
49
50

### LightLLM
```
python -m lightllm.server.api_server --model_dir ~/model_weights/Llama-2-7b-chat-hf --max_total_token_num 15600 --tokenizer_mode auto --port 22000
```

```
python3 bench_throughput.py --backend lightllm --tokenizer meta-llama/Llama-2-7b-chat-hf --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 10 --request-rate 10 --port 22000
```