benchmark_and_profiling.md 10.2 KB
Newer Older
Lianmin Zheng's avatar
Lianmin Zheng committed
1
2
3
# Benchmark and Profiling

## Benchmark
4

5
6
- Benchmark the latency of running a single static batch without a server. The arguments are the same as for `launch_server.py`.
  Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.
7
8
9
10
11
12
13
14
  - Without a server (do not need to launch a server)
    ```bash
    python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
    ```
  - With a server (please use `sglang.launch_server` to launch a server first and run the following command.)
    ```bash
    python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
    ```
15
16


17
- Benchmark offline processing. This script will start an offline engine and run the benchmark.
18
19

  ```bash
20
21
  python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
  ```
22

23
- Benchmark online serving. Please use `sglang.launch_server` to launch a server first and run the following command.
24
25

  ```bash
Lianmin Zheng's avatar
Lianmin Zheng committed
26
27
28
  python3 -m sglang.bench_serving --backend sglang --num-prompt 10
  ```

29
## Profile with PyTorch Profiler
30
31
32

[Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy.

Lianmin Zheng's avatar
Lianmin Zheng committed
33
### Profile a server with `sglang.bench_serving`
34

Lianmin Zheng's avatar
Lianmin Zheng committed
35
36
37
```bash
# set trace path
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
38

Lianmin Zheng's avatar
Lianmin Zheng committed
39
40
# start server
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
41

Lianmin Zheng's avatar
Lianmin Zheng committed
42
43
44
# send profiling request from client
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile
```
45

Lianmin Zheng's avatar
Lianmin Zheng committed
46
Please make sure that the `SGLANG_TORCH_PROFILER_DIR` should be set at both server and client side, otherwise the trace file cannot be generated correctly . A secure way will be setting `SGLANG_TORCH_PROFILER_DIR` in the `.*rc` file of shell (e.g. `~/.bashrc` for bash shells).
47

48
49
For more details, please refer to [Bench Serving Guide](./bench_serving.md).

Lianmin Zheng's avatar
Lianmin Zheng committed
50
51
52
### Profile a server with `sglang.bench_offline_throughput`
```bash
export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
53

Lianmin Zheng's avatar
Lianmin Zheng committed
54
55
56
# profile one batch with bench_one_batch.py
# batch size can be controlled with --batch argument
python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile
57

Lianmin Zheng's avatar
Lianmin Zheng committed
58
59
60
# profile multiple batches with bench_offline_throughput.py
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```
61

Lianmin Zheng's avatar
Lianmin Zheng committed
62
### Profile a server with `sglang.profiler`
63

Lianmin Zheng's avatar
Lianmin Zheng committed
64
When the server is running (e.g., processing a decoding request), you can start live profiling immediately by sending a profile request to the server.
65

Lianmin Zheng's avatar
Lianmin Zheng committed
66
You can do this by running `python3 -m sglang.profiler`. For example:
67

Lianmin Zheng's avatar
Lianmin Zheng committed
68
69
70
```
# Terminal 1: Send a generation request
python3 -m sglang.test.send_one
71

Lianmin Zheng's avatar
Lianmin Zheng committed
72
73
74
75
# Terminal 2: Before the above request finishes, quickly launch the following command in a separate terminal.
# It will generate a profile of the above request for several decoding batches.
python3 -m sglang.profiler
```
76

77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
### Profiler Trace Merger for Distributed Traces

SGLang now supports automatic merging of profiling traces from distributed setups with multiple parallelism types (TP, DP, PP, EP). This feature is particularly useful for analyzing performance across distributed runs.

#### Multi-Node Profiling and Shared Storage Considerations

Single-node profiler output merging is completely supported. When profiling in distributed environments spanning multiple nodes, shared storage (e.g., NFS, Lustre) should be accessible by all nodes for the output directory to enable merging of trace files.

If there is no shared storage accessible across nodes, automatic merging of trace files during profiling is not supported directly as of now.

#### HTTP API Usage

```bash
# Start profiling with automatic trace merging enabled
curl -X POST <BASE_URL>/start_profile \
  -H "Content-Type: application/json" \
  -d '{
    "output_dir": "/tmp/profiles", # where to store profile traces
    "num_steps": 10,
    "activities": ["CPU", "GPU"],
    "merge_profiles": true # optional argument to merge profile traces (default=False)
  }'
```

#### Command Line Usage

```bash
# Start profiling with merge enabled
python -m sglang.profiler \
  --num-steps 10 \
  --activities CPU GPU \
  --output-dir /tmp/profiles \
  --merge-profiles # optional argument to merge profile traces (default=False)
```

#### Output Files

The profile merger generates:
- Individual rank trace files: `{profile_id}-TP-{tp}-DP-{dp}-PP-{pp}-EP-{ep}.trace.json.gz`
- Merged trace file: `merged-{profile_id}.trace.json.gz`

Lianmin Zheng's avatar
Lianmin Zheng committed
118
119
120
121
122
123
124
125
126
127
### Possible PyTorch bugs
If in any cases you encounter the following error (for example, using qwen 2.5 VL):
```bash
RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty.
```
This is likely a PyTorch Bug reported in [Bug: vLLM Profiler](https://github.com/vllm-project/vllm/issues/18240) and [Bug: torch.profiler.profile](https://github.com/pytorch/pytorch/issues/101632). As a workaround, you may disable `with_stack` with an environment variable such as follows:
```bash
export SGLANG_PROFILE_WITH_STACK=False
python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
```
128

Lianmin Zheng's avatar
Lianmin Zheng committed
129
130
131
132
133
134
135
136
137
138
139
140
141
142
### View traces

Trace files can be loaded and visualized from:

1. https://ui.perfetto.dev/ (any browser)
2. chrome://tracing (Chrome browser only)

If browser cannot open trace file due to its large size,
client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs.
For example, when profiling a server,

```bash
python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile
```
143

Lianmin Zheng's avatar
Lianmin Zheng committed
144
This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.
145

Lianmin Zheng's avatar
Lianmin Zheng committed
146
Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
147

Lianmin Zheng's avatar
Lianmin Zheng committed
148
## Profile with Nsight
149

150
151
152
[Nsight systems](https://docs.nvidia.com/nsight-systems/) is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.

1. Prerequisite:
153

154
   Install using apt, or run inside a [NVIDIA Docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) or [SGLang Docker container](https://github.com/sgl-project/sglang/tree/main/docker).
Lianmin Zheng's avatar
Lianmin Zheng committed
155

156
157
158
159
160
161
162
163
164
165
   ```bash
   # install nsys
   # https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html
   apt update
   apt install -y --no-install-recommends gnupg
   echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
   apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
   apt update
   apt install nsight-systems-cli
   ```
Lianmin Zheng's avatar
Lianmin Zheng committed
166

167
2. To profile a single batch, use
Lianmin Zheng's avatar
Lianmin Zheng committed
168

169
170
171
   ```bash
   nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_one_batch --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512
   ```
172

173
3. To profile a server, e.g.
Lianmin Zheng's avatar
Lianmin Zheng committed
174

175
176
177
   ```bash
   # launch the server, set the delay and duration times according to needs
   # after the duration time has been used up, server will be killed by nsys
Lianmin Zheng's avatar
Lianmin Zheng committed
178

179
   nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache
180

181
182
183
   # client
   python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --dataset-name random --random-input 1024 --random-output 512
   ```
184

185
   In practice, we recommend users to set `--duration` argument to a large value. Whenever user wants the server to stop profiling. Firstly run:
186

187
188
189
   ```bash
   nsys sessions list
   ```
190

191
   to get the session id in the form of `profile-XXXXX`, then run:
192

193
194
195
   ```bash
   nsys stop --session=profile-XXXXX
   ```
Lianmin Zheng's avatar
Lianmin Zheng committed
196

197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
   to manually kill the profiler and generate `nsys-rep` files instantly.

4. Use NVTX to annotate code regions, e.g. to see their execution time.

   ```bash
   # install nvtx
   pip install nvtx
   ```

   ```python
   # code snippets
   import nvtx
   with nvtx.annotate("description", color="color"):
       # some critical code
   ```
212
213

## Other tips
214

215
1. You can benchmark a model using dummy weights by only providing the config.json file. This allows for quick testing of model variants without training. To do so, add `--load-format dummy` to the above commands and then you only need a correct `config.json` under the checkpoint folder.
216
217
218
219
220
221
2. You can benchmark a model with modified configs (e.g., less layers) by using `--json-model-override-args`. For example, you can benchmark a model with only 2 layers and 2 kv heads using:

   ```bash
   python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32 --load-format dummy --json-model-override-args '{"num_hidden_layers": 1, "num_key_value_heads": 1}'
   ```

222
3. You can use `--python-backtrace=cuda` to see python call stack for all CUDA kernels, as in PyTorch Profiler. (Caveat: this can cause inaccurately long kernel runtimes for CUDA event based timing)
223
4. For more arguments see [Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html).