Unverified Commit 29a38f03 authored by Cyrus Leung's avatar Cyrus Leung Committed by GitHub
Browse files

[Doc] Support "important" and "announcement" admonitions (#19479)


Signed-off-by: default avatarDarkLight1337 <tlleungac@connect.ust.hk>
parent a5115f4f
...@@ -130,7 +130,7 @@ pytest -s -v tests/test_logger.py ...@@ -130,7 +130,7 @@ pytest -s -v tests/test_logger.py
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
!!! warning !!! important
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability). If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
## Pull Requests & Code Reviews ## Pull Requests & Code Reviews
......
...@@ -48,8 +48,8 @@ Further update the model as follows: ...@@ -48,8 +48,8 @@ Further update the model as follows:
return vision_embeddings return vision_embeddings
``` ```
!!! warning !!! important
The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request. The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings. - Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
...@@ -100,8 +100,8 @@ Further update the model as follows: ...@@ -100,8 +100,8 @@ Further update the model as follows:
``` ```
!!! note !!! note
The model class does not have to be named `*ForCausalLM`. The model class does not have to be named `*ForCausalLM`.
Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples. Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
## 2. Specify processing information ## 2. Specify processing information
......
...@@ -18,7 +18,7 @@ After you have implemented your model (see [tutorial][new-model-basic]), put it ...@@ -18,7 +18,7 @@ After you have implemented your model (see [tutorial][new-model-basic]), put it
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM. Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models][supported-models] to promote your model! Finally, update our [list of supported models][supported-models] to promote your model!
!!! warning !!! important
The list of models in each section should be maintained in alphabetical order. The list of models in each section should be maintained in alphabetical order.
## Out-of-tree models ## Out-of-tree models
...@@ -49,6 +49,6 @@ def register(): ...@@ -49,6 +49,6 @@ def register():
) )
``` ```
!!! warning !!! important
If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface. If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
Read more about that [here][supports-multimodal]. Read more about that [here][supports-multimodal].
...@@ -15,7 +15,7 @@ Without them, the CI for your PR will fail. ...@@ -15,7 +15,7 @@ Without them, the CI for your PR will fail.
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>. Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM. This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
!!! warning !!! important
The list of models in each section should be maintained in alphabetical order. The list of models in each section should be maintained in alphabetical order.
!!! tip !!! tip
......
...@@ -7,7 +7,7 @@ page for information on known issues and how to solve them. ...@@ -7,7 +7,7 @@ page for information on known issues and how to solve them.
## Introduction ## Introduction
!!! warning !!! important
The source code references are to the state of the code at the time of writing in December, 2024. The source code references are to the state of the code at the time of writing in December, 2024.
The use of Python multiprocessing in vLLM is complicated by: The use of Python multiprocessing in vLLM is complicated by:
......
...@@ -211,7 +211,7 @@ for o in outputs: ...@@ -211,7 +211,7 @@ for o in outputs:
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat). Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
!!! warning !!! important
A chat template is **required** to use Chat Completions API. A chat template is **required** to use Chat Completions API.
For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`. For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
......
...@@ -61,7 +61,8 @@ from vllm import LLM, SamplingParams ...@@ -61,7 +61,8 @@ from vllm import LLM, SamplingParams
``` ```
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params]. The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params].
!!! warning
!!! important
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified. By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance. However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
...@@ -116,7 +117,7 @@ vllm serve Qwen/Qwen2.5-1.5B-Instruct ...@@ -116,7 +117,7 @@ vllm serve Qwen/Qwen2.5-1.5B-Instruct
!!! note !!! note
By default, the server uses a predefined chat template stored in the tokenizer. By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here][chat-template]. You can learn about overriding it [here][chat-template].
!!! warning !!! important
By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator. By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server. To disable this behavior, please pass `--generation-config vllm` when launching the server.
......
...@@ -34,3 +34,40 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link . ...@@ -34,3 +34,40 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .
color: rgba(255, 255, 255, 0.75) !important; color: rgba(255, 255, 255, 0.75) !important;
font-weight: 700; font-weight: 700;
} }
/* Custom admonitions */
:root {
--md-admonition-icon--announcement: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" width="16" height="16"><path d="M3.25 9a.75.75 0 0 1 .75.75c0 2.142.456 3.828.733 4.653a.122.122 0 0 0 .05.064.212.212 0 0 0 .117.033h1.31c.085 0 .18-.042.258-.152a.45.45 0 0 0 .075-.366A16.743 16.743 0 0 1 6 9.75a.75.75 0 0 1 1.5 0c0 1.588.25 2.926.494 3.85.293 1.113-.504 2.4-1.783 2.4H4.9c-.686 0-1.35-.41-1.589-1.12A16.4 16.4 0 0 1 2.5 9.75.75.75 0 0 1 3.25 9Z"></path><path d="M0 6a4 4 0 0 1 4-4h2.75a.75.75 0 0 1 .75.75v6.5a.75.75 0 0 1-.75.75H4a4 4 0 0 1-4-4Zm4-2.5a2.5 2.5 0 1 0 0 5h2v-5Z"></path><path d="M15.59.082A.75.75 0 0 1 16 .75v10.5a.75.75 0 0 1-1.189.608l-.002-.001h.001l-.014-.01a5.775 5.775 0 0 0-.422-.25 10.63 10.63 0 0 0-1.469-.64C11.576 10.484 9.536 10 6.75 10a.75.75 0 0 1 0-1.5c2.964 0 5.174.516 6.658 1.043.423.151.787.302 1.092.443V2.014c-.305.14-.669.292-1.092.443C11.924 2.984 9.713 3.5 6.75 3.5a.75.75 0 0 1 0-1.5c2.786 0 4.826-.484 6.155-.957.665-.236 1.154-.47 1.47-.64.144-.077.284-.161.421-.25l.014-.01a.75.75 0 0 1 .78-.061Z"></path></svg>');
--md-admonition-icon--important: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" width="16" height="16"><path d="M4.47.22A.749.749 0 0 1 5 0h6c.199 0 .389.079.53.22l4.25 4.25c.141.14.22.331.22.53v6a.749.749 0 0 1-.22.53l-4.25 4.25A.749.749 0 0 1 11 16H5a.749.749 0 0 1-.53-.22L.22 11.53A.749.749 0 0 1 0 11V5c0-.199.079-.389.22-.53Zm.84 1.28L1.5 5.31v5.38l3.81 3.81h5.38l3.81-3.81V5.31L10.69 1.5ZM8 4a.75.75 0 0 1 .75.75v3.5a.75.75 0 0 1-1.5 0v-3.5A.75.75 0 0 1 8 4Zm0 8a1 1 0 1 1 0-2 1 1 0 0 1 0 2Z"></path></svg>');
}
.md-typeset .admonition.announcement,
.md-typeset details.announcement {
border-color: rgb(255, 110, 66);
}
.md-typeset .admonition.important,
.md-typeset details.important {
border-color: rgb(239, 85, 82);
}
.md-typeset .announcement > .admonition-title,
.md-typeset .announcement > summary {
background-color: rgb(255, 110, 66, 0.1);
}
.md-typeset .important > .admonition-title,
.md-typeset .important > summary {
background-color: rgb(239, 85, 82, 0.1);
}
.md-typeset .announcement > .admonition-title::before,
.md-typeset .announcement > summary::before {
background-color: rgb(239, 85, 82);
-webkit-mask-image: var(--md-admonition-icon--announcement);
mask-image: var(--md-admonition-icon--announcement);
}
.md-typeset .important > .admonition-title::before,
.md-typeset .important > summary::before {
background-color: rgb(239, 85, 82);
-webkit-mask-image: var(--md-admonition-icon--important);
mask-image: var(--md-admonition-icon--important);
}
...@@ -51,7 +51,7 @@ for output in outputs: ...@@ -51,7 +51,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
!!! warning !!! important
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified. By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance. However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
...@@ -81,7 +81,7 @@ The [chat][vllm.LLM.chat] method implements chat functionality on top of [genera ...@@ -81,7 +81,7 @@ The [chat][vllm.LLM.chat] method implements chat functionality on top of [genera
In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat) In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt. and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt.
!!! warning !!! important
In general, only instruction-tuned models have a chat template. In general, only instruction-tuned models have a chat template.
Base models may perform poorly as they are not trained to respond to the chat conversation. Base models may perform poorly as they are not trained to respond to the chat conversation.
......
...@@ -379,7 +379,7 @@ Specified using `--task generate`. ...@@ -379,7 +379,7 @@ Specified using `--task generate`.
See [this page](./pooling_models.md) for more information on how to use pooling models. See [this page](./pooling_models.md) for more information on how to use pooling models.
!!! warning !!! important
Since some model architectures support both generative and pooling tasks, Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode. you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
...@@ -432,7 +432,7 @@ Specified using `--task reward`. ...@@ -432,7 +432,7 @@ Specified using `--task reward`.
If your model is not in the above list, we will try to automatically convert the model using If your model is not in the above list, we will try to automatically convert the model using
[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly. [as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
!!! warning !!! important
For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly, For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
e.g.: `--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`. e.g.: `--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
...@@ -485,7 +485,7 @@ On the other hand, modalities separated by `/` are mutually exclusive. ...@@ -485,7 +485,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.
See [this page][multimodal-inputs] on how to pass multi-modal inputs to the model. See [this page][multimodal-inputs] on how to pass multi-modal inputs to the model.
!!! warning !!! important
**To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference) **To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt: or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
...@@ -640,7 +640,7 @@ Specified using `--task generate`. ...@@ -640,7 +640,7 @@ Specified using `--task generate`.
See [this page](./pooling_models.md) for more information on how to use pooling models. See [this page](./pooling_models.md) for more information on how to use pooling models.
!!! warning !!! important
Since some model architectures support both generative and pooling tasks, Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode. you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
......
...@@ -36,7 +36,7 @@ print(completion.choices[0].message) ...@@ -36,7 +36,7 @@ print(completion.choices[0].message)
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example. vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`. You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
!!! warning !!! important
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator. By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
To disable this behavior, please pass `--generation-config vllm` when launching the server. To disable this behavior, please pass `--generation-config vllm` when launching the server.
...@@ -250,7 +250,7 @@ and passing a list of `messages` in the request. Refer to the examples below for ...@@ -250,7 +250,7 @@ and passing a list of `messages` in the request. Refer to the examples below for
--chat-template examples/template_vlm2vec.jinja --chat-template examples/template_vlm2vec.jinja
``` ```
!!! warning !!! important
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed` Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
to run this model in embedding mode instead of text generation mode. to run this model in embedding mode instead of text generation mode.
...@@ -294,13 +294,13 @@ and passing a list of `messages` in the request. Refer to the examples below for ...@@ -294,13 +294,13 @@ and passing a list of `messages` in the request. Refer to the examples below for
--chat-template examples/template_dse_qwen2_vl.jinja --chat-template examples/template_dse_qwen2_vl.jinja
``` ```
!!! warning !!! important
Like with VLM2Vec, we have to explicitly pass `--task embed`. Like with VLM2Vec, we have to explicitly pass `--task embed`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja> by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
!!! warning !!! important
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details. example below for details.
......
# vLLM V1 # vLLM V1
!!! important !!! announcement
We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details. We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment