"src/git@developer.sourcefind.cn:renzhc/diffusers_dcu.git" did not exist on "9b7e6f495fd7953b1716e6a967a77b46ac93fbc3"
Unverified Commit 321a963b authored by Yineng Zhang's avatar Yineng Zhang Committed by GitHub
Browse files

misc: update doc (#715)

parent e17deb27
...@@ -14,9 +14,10 @@ pip install -e "python[all]" ...@@ -14,9 +14,10 @@ pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
``` ```
### Set up HF_TOKEN ### Set up ulimit and HF_TOKEN
```bash ```bash
ulimit -n 65535
# Change the token to a real and usable one, with access permissions for the Llama 3 models. # Change the token to a real and usable one, with access permissions for the Llama 3 models.
export HF_TOKEN=hf_token export HF_TOKEN=hf_token
``` ```
...@@ -36,6 +37,13 @@ python -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-70B-Instruc ...@@ -36,6 +37,13 @@ python -m sglang.launch_server --model-path neuralmagic/Meta-Llama-3-70B-Instruc
## Benchmark ## Benchmark
### Hardware Requirements
- 8B models: Single NVIDIA A100 80GB GPU
- 70B models: 8 x NVIDIA A100 80GB GPUs with Tensor Parallelism (TP) 8
- 70B FP8 models: 8 x NVIDIA H100 GPUs with Tensor Parallelism (TP) 8
Please ensure you have the appropriate hardware before running the benchmarks.
#### Offline benchmark #### Offline benchmark
...@@ -86,3 +94,153 @@ cat sglang_online_benchmark.jsonl | cut -d':' -f9 | cut -d',' -f1 ...@@ -86,3 +94,153 @@ cat sglang_online_benchmark.jsonl | cut -d':' -f9 | cut -d',' -f1
We tried using vLLM 0.5.3.post1, but it often crashes under high loads, so we are using the older version, vLLM 0.5.2. We tried using vLLM 0.5.3.post1, but it often crashes under high loads, so we are using the older version, vLLM 0.5.2.
Preparation for TensorRT LLM can refer to https://github.com/sgl-project/tensorrt-demo. Specifically, we used a batch size of 512, a max input length of 8192, and a max number of tokens of 8192. The instance count for preprocessing and postprocessing in Triton Server is 16. Preparation for TensorRT LLM can refer to https://github.com/sgl-project/tensorrt-demo. Specifically, we used a batch size of 512, a max input length of 8192, and a max number of tokens of 8192. The instance count for preprocessing and postprocessing in Triton Server is 16.
```bash
# vLLM
pip install vllm==0.5.2
# Meta-Llama-3-8B-Instruct
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B-Instruct --disable-log-requests
# meta-llama/Meta-Llama-3-70B-Instruct
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B-Instruct --disable-log-requests --tensor 8
# neuralmagic/Meta-Llama-3-70B-Instruct-FP8
python -m vllm.entrypoints.openai.api_server --model neuralmagic/Meta-Llama-3-70B-Instruct-FP8 --disable-log-requests --tensor 8
```
```bash
wget https://raw.githubusercontent.com/sgl-project/sglang/main/python/sglang/bench_serving.py
```
```bash
# vLLM Offline
# Random dataset, Input [512, 1024], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5 --output-file vllm_offline_benchmark.jsonl
# Random dataset, Input [2048, 4096], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 1024 --random-range-ratio 0.5 --output-file vllm_offline_benchmark.jsonl
# Random dataset, Input [512, 1024], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 512 --random-range-ratio 0.5 --output-file vllm_offline_benchmark.jsonl
# Random dataset, Input [2048, 4096], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend vllm --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 512 --random-range-ratio 0.5 --output-file vllm_offline_benchmark.jsonl
# ShareGPT dataset, num prompts 3k
python3 bench_serving.py --backend vllm --num-prompts 3000 --output-file vllm_offline_benchmark.jsonl
# get output token throughput
cat vllm_offline_benchmark.jsonl | cut -d':' -f12 | cut -d',' -f1
```
```bash
# vLLM Online
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 1, num prompts 300
python3 bench_serving.py --backend vllm --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 300 --request-rate 1 --output-file vllm_online_benchmark.jsonl
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 2, num prompts 600
python3 bench_serving.py --backend vllm --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 600 --request-rate 2 --output-file vllm_online_benchmark.jsonl
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 4, num prompts 1200
python3 bench_serving.py --backend vllm --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 1200 --request-rate 4 --output-file vllm_online_benchmark.jsonl
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 8, num prompts 2400
python3 bench_serving.py --backend vllm --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 2400 --request-rate 8 --output-file vllm_online_benchmark.jsonl
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 16, num prompts 3200
python3 bench_serving.py --backend vllm --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 3200 --request-rate 16 --output-file vllm_online_benchmark.jsonl
# get median e2e latency
cat vllm_online_benchmark.jsonl | cut -d':' -f9 | cut -d',' -f1
```
```bash
# TensorRT LLM Offline 8B
# Random dataset, Input [512, 1024], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5 --output-file trt_offline_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
# Random dataset, Input [2048, 4096], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 1024 --random-range-ratio 0.5 --output-file trt_offline_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
# Random dataset, Input [512, 1024], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 512 --random-range-ratio 0.5 --output-file trt_offline_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
# Random dataset, Input [2048, 4096], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 512 --random-range-ratio 0.5 --output-file trt_offline_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
# ShareGPT dataset, num prompts 3k
python3 bench_serving.py --backend trt --num-prompts 3000 --output-file trt_offline_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
# get output token throughput
cat trt_offline_benchmark_8b.jsonl | cut -d':' -f12 | cut -d',' -f1
```
```bash
# TensorRT LLM Online 8B
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 1, num prompts 300
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 300 --request-rate 1 --output-file trt_online_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 2, num prompts 600
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 600 --request-rate 2 --output-file trt_online_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 4, num prompts 1200
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 1200 --request-rate 4 --output-file trt_online_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 8, num prompts 2400
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 2400 --request-rate 8 --output-file trt_online_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 16, num prompts 3200
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 3200 --request-rate 16 --output-file trt_online_benchmark_8b.jsonl --model meta-llama/Meta-Llama-3-8B-Instruct
# get median e2e latency
cat trt_online_benchmark_8b.jsonl | cut -d':' -f9 | cut -d',' -f1
```
```bash
# TensorRT LLM Offline 70B
# Random dataset, Input [512, 1024], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 1024 --random-range-ratio 0.5 --output-file trt_offline_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
# Random dataset, Input [2048, 4096], Output [512, 1024], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 1024 --random-range-ratio 0.5 --output-file trt_offline_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
# Random dataset, Input [512, 1024], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 1024 --random-output 512 --random-range-ratio 0.5 --output-file trt_offline_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
# Random dataset, Input [2048, 4096], Output [256, 512], num prompts 3k
python3 bench_serving.py --backend trt --dataset-name random --num-prompts 3000 --random-input 4096 --random-output 512 --random-range-ratio 0.5 --output-file trt_offline_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
# ShareGPT dataset, num prompts 3k
python3 bench_serving.py --backend trt --num-prompts 3000 --output-file trt_offline_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
# get output token throughput
cat trt_offline_benchmark_70b.jsonl | cut -d':' -f12 | cut -d',' -f1
```
```bash
# TensorRT LLM Online 70B
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 1, num prompts 300
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 300 --request-rate 1 --output-file trt_online_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 2, num prompts 600
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 600 --request-rate 2 --output-file trt_online_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 4, num prompts 1200
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 1200 --request-rate 4 --output-file trt_online_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 8, num prompts 2400
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 2400 --request-rate 8 --output-file trt_online_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
# Random dataset, Input [512, 4096], Output [128, 1024], request rate 16, num prompts 3200
python3 bench_serving.py --backend trt --dataset-name random --random-input 4096 --random-output 1024 --random-range-ratio 0.125 --num-prompts 3200 --request-rate 16 --output-file trt_online_benchmark_70b.jsonl --model meta-llama/Meta-Llama-3-70B-Instruct
# get median e2e latency
cat trt_online_benchmark_70b.jsonl | cut -d':' -f9 | cut -d',' -f1
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment