@@ -26,9 +26,9 @@ Data parallelism is better for throughput. When there is enough GPU memory, alwa
...
@@ -26,9 +26,9 @@ Data parallelism is better for throughput. When there is enough GPU memory, alwa
### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
If you see out of memory (OOM) errors, you can try to tune the following parameters.
If you see out of memory (OOM) errors, you can try to tune the following parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
- If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`.
- If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
### Try Advanced Options
### Try Advanced Options
- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.
- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.