@@ -25,16 +25,31 @@ On the other hand, if you see `token usage` very high and you frequently see war
...
@@ -25,16 +25,31 @@ On the other hand, if you see `token usage` very high and you frequently see war
`KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
`KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
If you see `KV cache pool is full. Retract requests.` occasionally but not frequently, it is okay.
If you see `KV cache pool is full. Retract requests.` occasionally but not frequently, it is okay.
### Tune `--mem-fraction-static` to increase the KV cache pool capacity
### Tune `--mem-fraction-static` to increase KV cache pool capacity
GPU memory capacity = model weights + KV cache pool + activations + CUDA graph buffers
To support higher concurrency, you should maximize the KV cache pool capacity by setting `--mem-fraction-static` as high as possible while still reserving enough memory for activations and CUDA graph buffers.
SGLang uses simple heuristics to set the default value of `--mem-fraction-static`, but you can optimize it for your use cases.
As a rule of thumb, reserving 5–8 GB of memory for activations is typically sufficient. You can check this by inspecting the logs just before the server is ready.