## How to reproduce the benchmark results for SGLang v0.3.0 compared to vLLM v0.6.0 ## Installation ```bash # install sglang v0.3.0 pip install --upgrade pip pip install "sglang[all]"==0.3.0 pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ # install vllm v0.6.0 pip install vllm==0.6.0 ``` ## Notes We referred to the reproduction method in https://github.com/vllm-project/vllm/issues/8176, and added the `--num-scheduler-steps 10` parameter when starting the vLLM server. The `gpu_memory_utilization` of vLLM is by default 0.9 at both TP 1 and TP 4, while SGLang's `mem_frac` is 0.88 at TP 1 and 0.85 at TP 4, so we manually set it to 0.88 at TP 4. ## Online benchmarks ```bash # Llama 3.1 8B Instruct on 1 x A100 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --disable-log-requests --num-scheduler-steps 10 --max_model_len 4096 # Llama 3.1 70B Instruct on 4 x H100 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-70B-Instruct --disable-radix-cache --tp 4 python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --disable-log-requests --num-scheduler-steps 10 --tensor 4 --max_model_len 4096 # bench serving python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 1200 --request-rate 4 python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 2400 --request-rate 8 python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 1200 --request-rate 4 python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 2400 --request-rate 8 ``` ## Offline benchmarks ```bash # Llama 3.1 8B Instruct on 1 x A100 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --disable-log-requests --num-scheduler-steps 10 --max_model_len 4096 # Llama 3.1 70B Instruct on 4 x H100 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-70B-Instruct --disable-radix-cache --tp 4 --mem-frac 0.88 python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --disable-log-requests --num-scheduler-steps 10 --tensor 4 --max_model_len 4096 # bench serving python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 5000 python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 5000 ``` ## Online benchmark results ### Llama 3.1 8B Instruct 1 x A100 80G | RPS | Num prompts | Engine | Median E2E Latency | Median TTFT | Median TPOT | Median ITL | |------|-------------|--------|--------------------|-------------|-------------|------------| | 4 | 1200 | SGLang | 1564.17 | **31.98** | 13.17 | **11.93** | | 4 | 1200 | vLLM | 1691.97 | **100.48** | 14.14 | **129.32** | | 8 | 2400 | SGLang | 2175.02 | **35.68** | 17.85 | **14.41** | | 8 | 2400 | vLLM | 2137.16 | **120.39** | 17.09 | **158.63** | ### Llama 3.1 70B Insruct 4 x H100 80G | RPS | Num Prompts | Engine | Median E2E Latency | Median TTFT | Median TPOT | Median ITL | |------|-------------|--------|--------------------|-------------|-------------|------------| | 4 | 1200 | SGLang | 3005.24 | **53.94** | 25.03 | **21.67** | | 4 | 1200 | vLLM | 2915.60 | **179.15** | 23.58 | **231.23** | | 8 | 2400 | SGLang | 4064.98 | **58.11** | 33.07 | **24.45** | | 8 | 2400 | vLLM | 3752.38 | **207.12** | 29.15 | **275.32** | ## Offline benchmark results ### Llama 3.1 8B Instruct 1 x A100 80G | RPS | Num Prompts | Engine | Request throughput | Output token throughput | |------|-------------|--------|--------------------|-------------------------| | inf | 5000 | SGLang | 22.03 | **4281.51** | | inf | 5000 | vLLM | 21.27 | **4132.37** | ### Llama 3.1 70B Insruct 4 x H100 80G | RPS | Num Prompts | Engine | Request throughput | Output token throughput | |------|-------------|--------|--------------------|-------------------------| | inf | 5000 | SGLang | 19.84 | **3856.01** | | inf | 5000 | vLLM | 19.04 | **3700.64** |