[Doc] Add more tips to avoid OOM (#16765)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Doc] Add more tips to avoid OOM (#16765)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
61a44a0b · Cyrus Leung · GitHub · a6481525 · 61a44a0b · 61a44a0b
Unverified Commit 61a44a0b authored Apr 17, 2025 by Cyrus Leung Committed by GitHub Apr 17, 2025
Showing with 33 additions and 0 deletions

docs/source/serving/offline_inference.md docs/source/serving/offline_inference.md +25 -0

docs/source/serving/openai_compatible_server.md docs/source/serving/openai_compatible_server.md +8 -0

No files found.
--- a/docs/source/serving/offline_inference.md
+++ b/docs/source/serving/offline_inference.md
@@ -28,6 +28,8 @@ Please refer to the above pages for more details about each API.
 [API Reference](/api/offline_inference/index)
 :::

+(configuration-options)=
+
 ## Configuration Options

 This section lists the most common options for running the vLLM engine.
@@ -184,6 +186,29 @@ llm = LLM(model="google/gemma-3-27b-it",
          limit_mm_per_prompt={"image": 0})
 ```

+#### Multi-modal processor arguments
+
+For certain models, you can adjust the multi-modal processor arguments to
+reduce the size of the processed multi-modal inputs, which in turn saves memory.
+
+Here are some examples:
+
+```python
+from vllm import LLM
+
+# Available for Qwen2-VL series models
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+          mm_processor_kwargs={
+              "max_pixels": 768 * 768,  # Default is 1280 * 28 * 28
+          })
+
+# Available for InternVL series models
+llm = LLM(model="OpenGVLab/InternVL2-2B",
+          mm_processor_kwargs={
+              "max_dynamic_patch": 4,  # Default is 12
+          })
+```
+
 ### Performance optimization and tuning

 You can potentially improve the performance of vLLM by finetuning various options.

--- a/docs/source/serving/openai_compatible_server.md
+++ b/docs/source/serving/openai_compatible_server.md
@@ -33,11 +33,13 @@ print(completion.choices[0].message)
 vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
 You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
 :::
+
 :::{important}
 By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.

 To disable this behavior, please pass `--generation-config vllm` when launching the server.
 :::
+
 ## Supported APIs

 We currently support the following OpenAI APIs:
@@ -172,6 +174,12 @@ print(completion._request_id)

 The `vllm serve` command is used to launch the OpenAI-compatible server.

+:::{tip}
+The vast majority of command-line arguments are based on those for offline inference.
+
+See [here](configuration-options) for some common options.
+:::
+
 :::{argparse}
 :module: vllm.entrypoints.openai.cli_args
 :func: create_parser_for_docs