[Doc] Add caution for API server scale-out (#23550)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Doc] Add caution for API server scale-out (#23550)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
e269be2b · Cyrus Leung · GitHub · 5c4b6e66 · e269be2b
Unverified Commit e269be2b authored Aug 25, 2025 by Cyrus Leung Committed by GitHub Aug 25, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 7 additions and 0 deletions

docs/configuration/optimization.md docs/configuration/optimization.md +7 -0

No files found.
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
@@ -196,6 +196,13 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
 !!! note
    API server scale-out is only available for online inference.
+!!! warning
+    By default, 8 CPU threads are used in each API server to load media items (e.g. images)
+    from request data.
+    If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT`
+    to avoid CPU resource exhaustion.
 !!! note
    [Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled
    because it requires a one-to-one correspondance between API and engine core processes.