hyperparameter_tuning.md 2.95 KB
Newer Older
Lianmin Zheng's avatar
Lianmin Zheng committed
1
2
3
4
5
6
7
# Guide on Hyperparameter Tuning

## Achieving Peak Throughput

Achieving a large batch size is the most important thing for attaining high throughput.

When the server is running at full load, look for the following in the log:
Lianmin Zheng's avatar
Lianmin Zheng committed
8

Lianmin Zheng's avatar
Lianmin Zheng committed
9
```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 317```
Lianmin Zheng's avatar
Lianmin Zheng committed
10
11
12

### Tune Your Request Submission Speed
`#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req == 0`, it suggests you are bottlenecked by the request submission speed.
Lianmin Zheng's avatar
Lianmin Zheng committed
13
A healthy range for `#queue-req` is `50 - 500`.
14
On the other hand, do not make `#queue-req` too large because it will also increase the scheduling overhead on the server.
Lianmin Zheng's avatar
Lianmin Zheng committed
15
16
17
18
19
20
21
22

### Tune `--schedule-conservativeness`
`token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization.
If you frequently see `token usage < 0.9` and `#queue-req > 0`, it means the server is too conservative about taking in new requests. You can decrease `--schedule-conservativeness` to a value like 0.3.
The case of serving being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings.

On the other hand, if you see `token usage` very high and you frequently see warnings like
`decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
23
If you see `decode out of memory happened` occasionally but not frequently, it is okay.
Lianmin Zheng's avatar
Lianmin Zheng committed
24
25
26
27

### Tune `--dp-size` and `--tp-size`
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.

28
### Avoid out-of-memory by tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
Lianmin Zheng's avatar
Lianmin Zheng committed
29
If you see out of memory (OOM) errors, you can decrease these parameters.  
30
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.  
Lianmin Zheng's avatar
Lianmin Zheng committed
31
If OOM happens during decoding, try to decrease `--max-running-requests`.  
32
33
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.

Lianmin Zheng's avatar
Lianmin Zheng committed
34
35
36
37
### Try advanced options
- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly.

38
39
### (Minor) Tune `--schedule-policy`
If you have many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match.
Lianmin Zheng's avatar
Lianmin Zheng committed
40
When you have no shared prefixes at all or you always send the requests with the shared prefixes together,
41
you can try `--schedule-policy fcfs`. `fcfs` stands for first come first serve.