Improve perf tuning docs (#7071)

90bd3e32 · Lianmin Zheng · GitHub · ca929118 · 90bd3e32 · 90bd3e32
Unverified Commit 90bd3e32 authored Jun 10, 2025 by Lianmin Zheng Committed by GitHub Jun 10, 2025
3 changed files
--- a/docs/backend/hyperparameter_tuning.md
+++ b/docs/backend/hyperparameter_tuning.md
 # Hyperparameter Tuning

-## Achieving Peak Throughput
+## Achieving high throughput for offline batch inference

-Achieving a large batch size is the most important thing for attaining high throughput.
+Achieving a large batch size is the most important thing for attaining high throughput in offline batch inference.
+When the server is running at full load in a steady state, look for the following in the log:

-When the server is running at full load, look for the following in the log:
+```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, cuda graph: True, gen throughput (token/s): 4594.01, #queue-req: 317```

-```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 317```
+### Adjust the request submission speed to control `#queue-req`

-### Tune Your Request Submission Speed
+`#queue-req` indicates the number of requests in the queue.
+If you frequently see `#queue-req: 0`, it suggests that your client code is submitting requests too slowly.
+A healthy range for `#queue-req` is `100 - 1000`.
+However, avoid making `#queue-req` too large, as this will increase the scheduling overhead on the server.

-`#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req == 0`, it suggests you are bottlenecked by the request submission speed.
-
-A healthy range for `#queue-req` is `50 - 500`.
-
-On the other hand, do not make `#queue-req` too large because it will also increase the scheduling overhead on the server, especially when using the default longest-prefix-match schedule policy (`--schedule-policy lpm`).
-
-### Tune `--schedule-conservativeness`
+### Tune `--schedule-conservativeness` to achieve a high `token usage`.

 `token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization.
+
 If you frequently see `token usage < 0.9` and `#queue-req > 0`, it means the server is too conservative about taking in new requests. You can decrease `--schedule-conservativeness` to a value like 0.3.
 The case of server being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings.

@@ -26,38 +25,37 @@ On the other hand, if you see `token usage` very high and you frequently see war
 `KV cache pool is full. Retract requests. #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
 If you see `KV cache pool is full. Retract requests.` occasionally but not frequently, it is okay.

-### Tune `--dp-size` and `--tp-size`
+### Tune `--mem-fraction-static` to increase the KV cache pool capacity
+GPU memory capacity = model weights + KV cache pool + activations + CUDA graph buffers

-Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../router/router.md) for a better data parallelism rather than using `dp_size` parameter.
+mem_fraction_static = (model weights + KV cache pool) / GPU memory capacity.

-## Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
+We want to increase the KV cache pool capacity to support a larger concurrency, so
+we want `--mem-fraction-static` to be as large as possible but still have enough room
+for activations and CUDA graph buffers.

-If you see out of memory (OOM) errors, you can try to tune the following parameters.
+A simple strategy is to increase `--mem-fraction-static` by 0.01 each time until you encounter out-of-memory errors.

- If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
- If OOM happens during decoding, try to decrease `--max-running-requests`.
- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
+## Avoid out-of-memory errors by tuning `--chunked-prefill-size`, `--mem-fraction-static`, and `--max-running-requests`

-## Enabling cache for `torch.compile`
+If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:

-To enable `torch.compile` acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. By default, `torch.compile` will automatically cache the FX graph and Triton in `/tmp/torchinductor_root`, which might be cleared according to the [system policy](https://serverfault.com/questions/377348/when-does-tmp-get-cleared). You can export the environment variable `TORCHINDUCTOR_CACHE_DIR` to save compilation cache in your desired directory to avoid unwanted deletion. You can also share the cache with other machines to reduce the compilation time.
+- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
+- If OOM occurs during decoding, try lowering `--max-running-requests`.
+- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.

-SGLang uses `max-autotune-no-cudagraphs` mode of `torch.compile`. The auto-tuning can be slow.
-If you want to deploy a model on many different machines, you can ship the `torch.compile` cache to these machines and skip the compilation steps. This is based on [PyTorch official documentation](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html).
+### Tune `--cuda-graph-max-bs`
+By default, CUDA graph is enabled only for small batch sizes (e.g., less than 160 or 256).
+However, for some models, especially at large tensor parallelism sizes, CUDA graph can be useful for batch sizes up to 512 or 768.
+Therefore, it may be beneficial to increase `--cuda-graph-max-bs` to a larger value.
+Note that CUDA graph consumes more memory, so you may need to reduce `--mem-fraction-static` at the same time.

-*Examples*：
-
-1. Generate the cache by setting `TORCHINDUCTOR_CACHE_DIR` and running the model once.
-
-   ```bash
-   TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile
-   ```
-
-2. Copy the cache folder to other machines and launch the server with `TORCHINDUCTOR_CACHE_DIR`.
+### Tune `--dp-size` and `--tp-size`

-## Tune `--schedule-policy`
+Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../router/router.md) for a better data parallelism rather than using `dp_size` parameter.

-If the workload has many shared prefixes, use the default `--schedule-policy lpm`. Where `lpm` stands for longest prefix match.
+### Try other options

-When you have no shared prefixes at all or you always send the requests with the shared prefixes together,
-you can try `--schedule-policy fcfs`. Where `fcfs` stands for first come first serve. This policy has a lower scheduling overhead.
+- `torch.compile` accelerates small models on small batch sizes. You can enable it with `--enable-torch-compile`.
+- Try other quantization (e.g. FP8 quantizatioin) or other parallelism strategies (e.g. expert parallelism)
+- If the workload has many shared prefixes, try `--schedule-policy lpm`. Here, `lpm` stands for longest prefix match. It reorders requests to encourage more cache hits but introduces more scheduling overhead.
--- a/docs/references/torch_compile_cache.md
+++ b/docs/references/torch_compile_cache.md
+# Enabling cache for torch.compile
+
+SGLang uses `max-autotune-no-cudagraphs` mode of torch.compile. The auto-tuning can be slow.
+If you want to deploy a model on many different machines, you can ship the torch.compile cache to these machines and skip the compilation steps.
+
+This is based on https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html
+
+
+1. Generate the cache by setting TORCHINDUCTOR_CACHE_DIR and running the model once.
+```
+TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile
+```
+2. Copy the cache folder to other machines and launch the server with `TORCHINDUCTOR_CACHE_DIR`.
--- a/docs/references/troubleshooting.md
+++ b/docs/references/troubleshooting.md
 # Troubleshooting

-This page lists some common errors and tips for fixing them.
-
-## CUDA out of memory
-If you see out of memory (OOM) errors, you can try to tune the following parameters.
- If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
- If OOM happens during decoding, try to decrease `--max-running-requests`.
- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
-
-## CUDA error: an illegal memory access was encountered
-This error may be due to kernel errors or out-of-memory issues.
- If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub.
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above section to avoid the OOM.
+This page lists common errors and tips for resolving them.
+
+## CUDA Out of Memory
+If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
+
+- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
+- If OOM occurs during decoding, try lowering `--max-running-requests`.
+- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
+- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
+
+## CUDA Error: Illegal Memory Access Encountered
+This error may result from kernel errors or out-of-memory issues:
+- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
+- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.