Unverified Commit e269be2b authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc] Add caution for API server scale-out (#23550)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent 5c4b6e66
...@@ -196,6 +196,13 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2 ...@@ -196,6 +196,13 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
!!! note !!! note
API server scale-out is only available for online inference. API server scale-out is only available for online inference.
!!! warning
By default, 8 CPU threads are used in each API server to load media items (e.g. images)
from request data.
If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT`
to avoid CPU resource exhaustion.
!!! note !!! note
[Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled [Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled
because it requires a one-to-one correspondance between API and engine core processes. because it requires a one-to-one correspondance between API and engine core processes.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment