Unverified Commit 87a0db82 authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

update hyperparameter guide (#1114)

parent 5bd95374
...@@ -10,7 +10,8 @@ When the server is running at full load, look for the following in the log: ...@@ -10,7 +10,8 @@ When the server is running at full load, look for the following in the log:
### Tune Your Request Submission Speed ### Tune Your Request Submission Speed
`#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req == 0`, it suggests you are bottlenecked by the request submission speed. `#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req == 0`, it suggests you are bottlenecked by the request submission speed.
A healthy range for `#queue-req` is `100 - 1000`. A healthy range for `#queue-req` is `50 - 1000`.
On the other hand, do not make `#queue-req` too large because it will also increase the scheduling overhead on the server.
### Tune `--schedule-conservativeness` ### Tune `--schedule-conservativeness`
`token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization. `token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization.
...@@ -19,13 +20,14 @@ The case of serving being too conservative can happen when users send many reque ...@@ -19,13 +20,14 @@ The case of serving being too conservative can happen when users send many reque
On the other hand, if you see `token usage` very high and you frequently see warnings like On the other hand, if you see `token usage` very high and you frequently see warnings like
`decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3. `decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
If you see `decode out of memory happened` occasionally but not frequently, it is okay.
### Tune `--dp-size` and `--tp-size` ### Tune `--dp-size` and `--tp-size`
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.
### (Minor) Tune `--max-prefill-tokens`, `--mem-fraction-static`, `--max-running-requests` ### Avoid out-of-memory by tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
If you see out of memory (OOM) errors, you can decrease these parameters. If you see out of memory (OOM) errors, you can decrease these parameters.
If OOM happens during prefill, try to decrease `--max-prefill-tokens`. If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`. If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment