Unverified Commit dd6a3a02 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

[Doc] Convert docs to use colon fences (#12471)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent a7e3eba6
...@@ -4,6 +4,7 @@ ...@@ -4,6 +4,7 @@
Below, you can find an explanation of every engine argument for vLLM: Below, you can find an explanation of every engine argument for vLLM:
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
```{eval-rst} ```{eval-rst}
.. argparse:: .. argparse::
:module: vllm.engine.arg_utils :module: vllm.engine.arg_utils
...@@ -16,6 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM: ...@@ -16,6 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM:
Below are the additional arguments related to the asynchronous engine: Below are the additional arguments related to the asynchronous engine:
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
```{eval-rst} ```{eval-rst}
.. argparse:: .. argparse::
:module: vllm.engine.arg_utils :module: vllm.engine.arg_utils
......
...@@ -2,14 +2,14 @@ ...@@ -2,14 +2,14 @@
vLLM uses the following environment variables to configure the system: vLLM uses the following environment variables to configure the system:
```{warning} :::{warning}
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work. Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables). All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
``` :::
```{literalinclude} ../../../vllm/envs.py :::{literalinclude} ../../../vllm/envs.py
:end-before: end-env-vars-definition :end-before: end-env-vars-definition
:language: python :language: python
:start-after: begin-env-vars-definition :start-after: begin-env-vars-definition
``` :::
# External Integrations # External Integrations
```{toctree} :::{toctree}
:maxdepth: 1 :maxdepth: 1
langchain langchain
llamaindex llamaindex
``` :::
...@@ -31,8 +31,8 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I ...@@ -31,8 +31,8 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I
The following metrics are exposed: The following metrics are exposed:
```{literalinclude} ../../../vllm/engine/metrics.py :::{literalinclude} ../../../vllm/engine/metrics.py
:end-before: end-metrics-definitions :end-before: end-metrics-definitions
:language: python :language: python
:start-after: begin-metrics-definitions :start-after: begin-metrics-definitions
``` :::
...@@ -4,10 +4,10 @@ ...@@ -4,10 +4,10 @@
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM. This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
```{note} :::{note}
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes, We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests. and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
``` :::
## Offline Inference ## Offline Inference
...@@ -203,13 +203,13 @@ for o in outputs: ...@@ -203,13 +203,13 @@ for o in outputs:
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
```{important} :::{important}
A chat template is **required** to use Chat Completions API. A chat template is **required** to use Chat Completions API.
Although most models come with a chat template, for others you have to define one yourself. Although most models come with a chat template, for others you have to define one yourself.
The chat template can be inferred based on the documentation on the model's HuggingFace repo. The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja> For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja>
``` :::
### Image ### Image
...@@ -273,24 +273,25 @@ print("Chat completion output:", chat_response.choices[0].message.content) ...@@ -273,24 +273,25 @@ print("Chat completion output:", chat_response.choices[0].message.content)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
```{tip} :::{tip}
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine, Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
and pass the file path as `url` in the API request. and pass the file path as `url` in the API request.
``` :::
```{tip} :::{tip}
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content. There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content. In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
``` :::
````{note} :::{note}
By default, the timeout for fetching images through HTTP URL is `5` seconds. By default, the timeout for fetching images through HTTP URL is `5` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```console
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout> export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
``` ```
````
:::
### Video ### Video
...@@ -345,14 +346,15 @@ print("Chat completion output from image url:", result) ...@@ -345,14 +346,15 @@ print("Chat completion output from image url:", result)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
````{note} :::{note}
By default, the timeout for fetching videos through HTTP URL is `30` seconds. By default, the timeout for fetching videos through HTTP URL is `30` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```console
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout> export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
``` ```
````
:::
### Audio ### Audio
...@@ -448,24 +450,25 @@ print("Chat completion output from audio url:", result) ...@@ -448,24 +450,25 @@ print("Chat completion output from audio url:", result)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
````{note} :::{note}
By default, the timeout for fetching audios through HTTP URL is `10` seconds. By default, the timeout for fetching audios through HTTP URL is `10` seconds.
You can override this by setting the environment variable: You can override this by setting the environment variable:
```console ```console
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout> export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
``` ```
````
:::
### Embedding ### Embedding
vLLM's Embeddings API is a superset of OpenAI's [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings), vLLM's Embeddings API is a superset of OpenAI's [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings),
where a list of chat `messages` can be passed instead of batched `inputs`. This enables multi-modal inputs to be passed to embedding models. where a list of chat `messages` can be passed instead of batched `inputs`. This enables multi-modal inputs to be passed to embedding models.
```{tip} :::{tip}
The schema of `messages` is exactly the same as in Chat Completions API. The schema of `messages` is exactly the same as in Chat Completions API.
You can refer to the above tutorials for more details on how to pass each type of multi-modal data. You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
``` :::
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images. Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
Refer to the examples below for illustration. Refer to the examples below for illustration.
...@@ -477,13 +480,13 @@ vllm serve TIGER-Lab/VLM2Vec-Full --task embed \ ...@@ -477,13 +480,13 @@ vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
``` ```
```{important} :::{important}
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed` Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
to run this model in embedding mode instead of text generation mode. to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model, The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec.jinja> and can be found here: <gh-file:examples/template_vlm2vec.jinja>
``` :::
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library: Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
...@@ -518,16 +521,16 @@ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \ ...@@ -518,16 +521,16 @@ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
``` ```
```{important} :::{important}
Like with VLM2Vec, we have to explicitly pass `--task embed`. Like with VLM2Vec, we have to explicitly pass `--task embed`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja> by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
``` :::
```{important} :::{important}
Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details. example below for details.
``` :::
Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py> Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
...@@ -22,9 +22,9 @@ The available APIs depend on the type of model that is being run: ...@@ -22,9 +22,9 @@ The available APIs depend on the type of model that is being run:
Please refer to the above pages for more details about each API. Please refer to the above pages for more details about each API.
```{seealso} :::{seealso}
[API Reference](/api/offline_inference/index) [API Reference](/api/offline_inference/index)
``` :::
## Configuration Options ## Configuration Options
...@@ -70,12 +70,12 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", ...@@ -70,12 +70,12 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
tensor_parallel_size=2) tensor_parallel_size=2)
``` ```
```{important} :::{important}
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`) To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`. before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable. To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
``` :::
#### Quantization #### Quantization
......
...@@ -161,11 +161,11 @@ print(completion._request_id) ...@@ -161,11 +161,11 @@ print(completion._request_id)
The `vllm serve` command is used to launch the OpenAI-compatible server. The `vllm serve` command is used to launch the OpenAI-compatible server.
```{argparse} :::{argparse}
:module: vllm.entrypoints.openai.cli_args :module: vllm.entrypoints.openai.cli_args
:func: create_parser_for_docs :func: create_parser_for_docs
:prog: vllm serve :prog: vllm serve
``` :::
#### Configuration file #### Configuration file
...@@ -188,10 +188,10 @@ To use the above config file: ...@@ -188,10 +188,10 @@ To use the above config file:
vllm serve SOME_MODEL --config config.yaml vllm serve SOME_MODEL --config config.yaml
``` ```
```{note} :::{note}
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence. In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
The order of priorities is `command line > config file values > defaults`. The order of priorities is `command line > config file values > defaults`.
``` :::
## API Reference ## API Reference
...@@ -208,19 +208,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py> ...@@ -208,19 +208,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>
The following [sampling parameters](#sampling-params) are supported. The following [sampling parameters](#sampling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-completion-sampling-params :start-after: begin-completion-sampling-params
:end-before: end-completion-sampling-params :end-before: end-completion-sampling-params
``` :::
The following extra parameters are supported: The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-completion-extra-params :start-after: begin-completion-extra-params
:end-before: end-completion-extra-params :end-before: end-completion-extra-params
``` :::
(chat-api)= (chat-api)=
...@@ -240,19 +240,19 @@ Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py> ...@@ -240,19 +240,19 @@ Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
The following [sampling parameters](#sampling-params) are supported. The following [sampling parameters](#sampling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-chat-completion-sampling-params :start-after: begin-chat-completion-sampling-params
:end-before: end-chat-completion-sampling-params :end-before: end-chat-completion-sampling-params
``` :::
The following extra parameters are supported: The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-chat-completion-extra-params :start-after: begin-chat-completion-extra-params
:end-before: end-chat-completion-extra-params :end-before: end-chat-completion-extra-params
``` :::
(embeddings-api)= (embeddings-api)=
...@@ -264,9 +264,9 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai ...@@ -264,9 +264,9 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api)) If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
which will be treated as a single prompt to the model. which will be treated as a single prompt to the model.
```{tip} :::{tip}
This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details. This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
``` :::
Code example: <gh-file:examples/online_serving/openai_embedding_client.py> Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
...@@ -274,27 +274,27 @@ Code example: <gh-file:examples/online_serving/openai_embedding_client.py> ...@@ -274,27 +274,27 @@ Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
The following [pooling parameters](#pooling-params) are supported. The following [pooling parameters](#pooling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-embedding-pooling-params :start-after: begin-embedding-pooling-params
:end-before: end-embedding-pooling-params :end-before: end-embedding-pooling-params
``` :::
The following extra parameters are supported by default: The following extra parameters are supported by default:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-embedding-extra-params :start-after: begin-embedding-extra-params
:end-before: end-embedding-extra-params :end-before: end-embedding-extra-params
``` :::
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead: For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-chat-embedding-extra-params :start-after: begin-chat-embedding-extra-params
:end-before: end-chat-embedding-extra-params :end-before: end-chat-embedding-extra-params
``` :::
(tokenizer-api)= (tokenizer-api)=
...@@ -465,19 +465,19 @@ Response: ...@@ -465,19 +465,19 @@ Response:
The following [pooling parameters](#pooling-params) are supported. The following [pooling parameters](#pooling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-score-pooling-params :start-after: begin-score-pooling-params
:end-before: end-score-pooling-params :end-before: end-score-pooling-params
``` :::
The following extra parameters are supported: The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-score-extra-params :start-after: begin-score-extra-params
:end-before: end-score-extra-params :end-before: end-score-extra-params
``` :::
(rerank-api)= (rerank-api)=
...@@ -552,16 +552,16 @@ Response: ...@@ -552,16 +552,16 @@ Response:
The following [pooling parameters](#pooling-params) are supported. The following [pooling parameters](#pooling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-rerank-pooling-params :start-after: begin-rerank-pooling-params
:end-before: end-rerank-pooling-params :end-before: end-rerank-pooling-params
``` :::
The following extra parameters are supported: The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py :::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python :language: python
:start-after: begin-rerank-extra-params :start-after: begin-rerank-extra-params
:end-before: end-rerank-extra-params :end-before: end-rerank-extra-params
``` :::
...@@ -111,6 +111,7 @@ markers = [ ...@@ -111,6 +111,7 @@ markers = [
] ]
[tool.pymarkdown] [tool.pymarkdown]
plugins.md004.style = "sublist" # ul-style
plugins.md013.enabled = false # line-length plugins.md013.enabled = false # line-length
plugins.md041.enabled = false # first-line-h1 plugins.md041.enabled = false # first-line-h1
plugins.md033.enabled = false # inline-html plugins.md033.enabled = false # inline-html
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment