Unverified Commit 8c07fabd authored by Lianmin Zheng's avatar Lianmin Zheng Committed by GitHub
Browse files

Update hyperparameter_tuning.md (#9083)

parent 90f44b74
...@@ -25,16 +25,31 @@ On the other hand, if you see `token usage` very high and you frequently see war ...@@ -25,16 +25,31 @@ On the other hand, if you see `token usage` very high and you frequently see war
`KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3. `KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
If you see `KV cache pool is full. Retract requests.` occasionally but not frequently, it is okay. If you see `KV cache pool is full. Retract requests.` occasionally but not frequently, it is okay.
### Tune `--mem-fraction-static` to increase the KV cache pool capacity ### Tune `--mem-fraction-static` to increase KV cache pool capacity
GPU memory capacity = model weights + KV cache pool + activations + CUDA graph buffers SGLang allocates memory as follows:
mem_fraction_static = (model weights + KV cache pool) / GPU memory capacity. Total memory usage = model weights + KV cache pool + CUDA graph buffers + activations
We want to increase the KV cache pool capacity to support a larger concurrency, so The `--mem-fraction-static` parameter determines how much memory is allocated to the first two components:
we want `--mem-fraction-static` to be as large as possible but still have enough room
for activations and CUDA graph buffers.
A simple strategy is to increase `--mem-fraction-static` by 0.01 each time until you encounter out-of-memory errors. mem_fraction_static = (model weights + KV cache pool) / GPU memory capacity
To support higher concurrency, you should maximize the KV cache pool capacity by setting `--mem-fraction-static` as high as possible while still reserving enough memory for activations and CUDA graph buffers.
SGLang uses simple heuristics to set the default value of `--mem-fraction-static`, but you can optimize it for your use cases.
As a rule of thumb, reserving 5–8 GB of memory for activations is typically sufficient. You can check this by inspecting the logs just before the server is ready.
Look for log entries like this:
```
[2025-08-11 17:17:03] max_total_num_tokens=665690, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=65536, available_gpu_mem=13.50 GB
```
Check the `available_gpu_mem` value.
- If it is between 5–8 GB, the setting is good.
- If it is too high (e.g., 10 - 20 GB), increase `--mem-fraction-static` to allocate more memory to the KV cache.
- If it is too low, you risk out-of-memory (OOM) errors later, so decrease `--mem-fraction-static`.
Another straightforward approach is to increase `--mem-fraction-static` in increments of 0.01 until you encounter OOM errors for your workloads.
### Avoid out-of-memory errors by tuning `--chunked-prefill-size`, `--mem-fraction-static`, and `--max-running-requests` ### Avoid out-of-memory errors by tuning `--chunked-prefill-size`, `--mem-fraction-static`, and `--max-running-requests`
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment