Unverified Commit b4bab816 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Remove unnecessary explicit title anchors and use relative links instead (#20620)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent b91cb3fa
--- ---
title: INT8 W8A8 title: INT8 W8A8
--- ---
[](){ #int8 }
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration. vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance. This quantization method is particularly useful for reducing model size while maintaining good performance.
......
--- ---
title: Quantized KV Cache title: Quantized KV Cache
--- ---
[](){ #quantized-kvcache }
## FP8 KV Cache ## FP8 KV Cache
......
--- ---
title: AMD Quark title: AMD Quark
--- ---
[](){ #quark }
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/), throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
......
--- ---
title: Supported Hardware title: Supported Hardware
--- ---
[](){ #quantization-supported-hardware }
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM: The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
......
--- ---
title: Reasoning Outputs title: Reasoning Outputs
--- ---
[](){ #reasoning-outputs }
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions. vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
......
--- ---
title: Speculative Decoding title: Speculative Decoding
--- ---
[](){ #spec-decode }
!!! warning !!! warning
Please note that speculative decoding in vLLM is not yet optimized and does Please note that speculative decoding in vLLM is not yet optimized and does
...@@ -269,7 +268,7 @@ speculative decoding, breaking down the guarantees into three key areas: ...@@ -269,7 +268,7 @@ speculative decoding, breaking down the guarantees into three key areas:
3. **vLLM Logprob Stability** 3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq]. titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors: can occur due to following factors:
...@@ -278,7 +277,7 @@ can occur due to following factors: ...@@ -278,7 +277,7 @@ can occur due to following factors:
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially - **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability. due to non-deterministic behavior in batched operations or numerical instability.
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq]. For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
## Resources for vLLM contributors ## Resources for vLLM contributors
......
--- ---
title: Structured Outputs title: Structured Outputs
--- ---
[](){ #structured-outputs }
vLLM supports the generation of structured outputs using vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or [xgrammar](https://github.com/mlc-ai/xgrammar) or
...@@ -21,7 +20,7 @@ The following parameters are supported, which must be added as extra parameters: ...@@ -21,7 +20,7 @@ The following parameters are supported, which must be added as extra parameters:
- `guided_grammar`: the output will follow the context free grammar. - `guided_grammar`: the output will follow the context free grammar.
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text. - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page. You can see the complete list of supported parameters on the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) page.
Structured outputs are supported by default in the OpenAI-Compatible Server. You Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the may choose to specify the backend to use by setting the
......
--- ---
title: Installation title: Installation
--- ---
[](){ #installation-index }
vLLM supports the following hardware platforms: vLLM supports the following hardware platforms:
......
...@@ -109,8 +109,8 @@ docker run \ ...@@ -109,8 +109,8 @@ docker run \
### Supported features ### Supported features
- [Offline inference][offline-inference] - [Offline inference](../../serving/offline_inference.md)
- Online serving via [OpenAI-Compatible Server][serving-openai-compatible-server] - Online serving via [OpenAI-Compatible Server](../../serving/openai_compatible_server.md)
- HPU autodetection - no need to manually select device within vLLM - HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators - Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops, - Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
......
--- ---
title: Quickstart title: Quickstart
--- ---
[](){ #quickstart }
This guide will help you quickly get started with vLLM to perform: This guide will help you quickly get started with vLLM to perform:
...@@ -43,7 +42,7 @@ uv pip install vllm --torch-backend=auto ...@@ -43,7 +42,7 @@ uv pip install vllm --torch-backend=auto
``` ```
!!! note !!! note
For more detail and non-CUDA platforms, please refer [here][installation-index] for specific instructions on how to install vLLM. For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
[](){ #quickstart-offline } [](){ #quickstart-offline }
...@@ -77,7 +76,7 @@ prompts = [ ...@@ -77,7 +76,7 @@ prompts = [
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
``` ```
The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here][supported-models]. The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here](../models/supported_models.md).
```python ```python
llm = LLM(model="facebook/opt-125m") llm = LLM(model="facebook/opt-125m")
......
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import itertools import itertools
import logging
from dataclasses import dataclass, field from dataclasses import dataclass, field
from pathlib import Path from pathlib import Path
from typing import Literal from typing import Literal
import regex as re import regex as re
logger = logging.getLogger("mkdocs")
ROOT_DIR = Path(__file__).parent.parent.parent.parent ROOT_DIR = Path(__file__).parent.parent.parent.parent
ROOT_DIR_RELATIVE = '../../../../..' ROOT_DIR_RELATIVE = '../../../../..'
EXAMPLE_DIR = ROOT_DIR / "examples" EXAMPLE_DIR = ROOT_DIR / "examples"
EXAMPLE_DOC_DIR = ROOT_DIR / "docs/examples" EXAMPLE_DOC_DIR = ROOT_DIR / "docs/examples"
print(ROOT_DIR.resolve())
print(EXAMPLE_DIR.resolve())
print(EXAMPLE_DOC_DIR.resolve())
def fix_case(text: str) -> str: def fix_case(text: str) -> str:
...@@ -135,6 +135,11 @@ class Example: ...@@ -135,6 +135,11 @@ class Example:
def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool): def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
logger.info("Generating example documentation")
logger.debug("Root directory: %s", ROOT_DIR.resolve())
logger.debug("Example directory: %s", EXAMPLE_DIR.resolve())
logger.debug("Example document directory: %s", EXAMPLE_DOC_DIR.resolve())
# Create the EXAMPLE_DOC_DIR if it doesn't exist # Create the EXAMPLE_DOC_DIR if it doesn't exist
if not EXAMPLE_DOC_DIR.exists(): if not EXAMPLE_DOC_DIR.exists():
EXAMPLE_DOC_DIR.mkdir(parents=True) EXAMPLE_DOC_DIR.mkdir(parents=True)
...@@ -156,7 +161,7 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool): ...@@ -156,7 +161,7 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
for example in sorted(examples, key=lambda e: e.path.stem): for example in sorted(examples, key=lambda e: e.path.stem):
example_name = f"{example.path.stem}.md" example_name = f"{example.path.stem}.md"
doc_path = EXAMPLE_DOC_DIR / example.category / example_name doc_path = EXAMPLE_DOC_DIR / example.category / example_name
print(doc_path) logger.debug("Example generated: %s", doc_path.relative_to(ROOT_DIR))
if not doc_path.parent.exists(): if not doc_path.parent.exists():
doc_path.parent.mkdir(parents=True) doc_path.parent.mkdir(parents=True)
with open(doc_path, "w+") as f: with open(doc_path, "w+") as f:
......
--- ---
title: Loading models with Run:ai Model Streamer title: Loading models with Run:ai Model Streamer
--- ---
[](){ #runai-model-streamer }
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md). Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
......
--- ---
title: Loading models with CoreWeave's Tensorizer title: Loading models with CoreWeave's Tensorizer
--- ---
[](){ #tensorizer }
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer). vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
......
--- ---
title: Generative Models title: Generative Models
--- ---
[](){ #generative-models }
vLLM provides first-class support for generative models, which covers most of LLMs. vLLM provides first-class support for generative models, which covers most of LLMs.
...@@ -134,7 +133,7 @@ outputs = llm.chat(conversation, chat_template=custom_template) ...@@ -134,7 +133,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
## Online Serving ## Online Serving
Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
- [Completions API][completions-api] is similar to `LLM.generate` but only accepts text. - [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
- [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template. - [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.
--- ---
title: TPU title: TPU
--- ---
[](){ #tpu-supported-models }
# TPU Supported Models # TPU Supported Models
## Text-only Language Models ## Text-only Language Models
......
--- ---
title: Pooling Models title: Pooling Models
--- ---
[](){ #pooling-models }
vLLM also supports pooling models, including embedding, reranking and reward models. vLLM also supports pooling models, including embedding, reranking and reward models.
...@@ -11,7 +10,7 @@ before returning them. ...@@ -11,7 +10,7 @@ before returning them.
!!! note !!! note
We currently support pooling models primarily as a matter of convenience. We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix][compatibility-matrix], most vLLM features are not applicable to As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much. pooling models as they only work on the generation or decode stage, so performance may not improve as much.
For pooling models, we support the following `--task` options. For pooling models, we support the following `--task` options.
...@@ -113,10 +112,10 @@ A code example can be found here: <gh-file:examples/offline_inference/basic/scor ...@@ -113,10 +112,10 @@ A code example can be found here: <gh-file:examples/offline_inference/basic/scor
## Online Serving ## Online Serving
Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs: Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
- [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models. - [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs][multimodal-inputs] for embedding models. - [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
- [Classification API][classification-api] is similar to `LLM.classify` and is applicable to sequence classification models. - [Classification API][classification-api] is similar to `LLM.classify` and is applicable to sequence classification models.
- [Score API][score-api] is similar to `LLM.score` for cross-encoder models. - [Score API][score-api] is similar to `LLM.score` for cross-encoder models.
......
--- ---
title: Supported Models title: Supported Models
--- ---
[](){ #supported-models }
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks. vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
If a model supports more than one task, you can set the task via the `--task` argument. If a model supports more than one task, you can set the task via the `--task` argument.
...@@ -34,7 +33,7 @@ llm.apply_model(lambda model: print(type(model))) ...@@ -34,7 +33,7 @@ llm.apply_model(lambda model: print(type(model)))
If it is `TransformersForCausalLM` then it means it's based on Transformers! If it is `TransformersForCausalLM` then it means it's based on Transformers!
!!! tip !!! tip
You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference][offline-inference] or `--model-impl transformers` for the [openai-compatible-server][serving-openai-compatible-server]. You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference](../serving/offline_inference.md) or `--model-impl transformers` for the [openai-compatible-server](../serving/openai_compatible_server.md).
!!! note !!! note
vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM. vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
...@@ -53,8 +52,8 @@ For a model to be compatible with the Transformers backend for vLLM it must: ...@@ -53,8 +52,8 @@ For a model to be compatible with the Transformers backend for vLLM it must:
If the compatible model is: If the compatible model is:
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][serving-openai-compatible-server]. - on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference](../serving/offline_inference.md) or `--trust-remote-code` for the [openai-compatible-server](../serving/openai_compatible_server.md).
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference][offline-inference] or `vllm serve <MODEL_DIR>` for the [openai-compatible-server][serving-openai-compatible-server]. - in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference](../serving/offline_inference.md) or `vllm serve <MODEL_DIR>` for the [openai-compatible-server](../serving/openai_compatible_server.md).
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM! This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
...@@ -171,7 +170,7 @@ The [Transformers backend][transformers-backend] enables you to run models direc ...@@ -171,7 +170,7 @@ The [Transformers backend][transformers-backend] enables you to run models direc
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported. If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
Otherwise, please refer to [Adding a New Model][new-model] for instructions on how to implement your model in vLLM. Otherwise, please refer to [Adding a New Model](../contributing/model/README.md) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support. Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
#### Download a model #### Download a model
...@@ -308,13 +307,13 @@ print(output) ...@@ -308,13 +307,13 @@ print(output)
### Generative Models ### Generative Models
See [this page][generative-models] for more information on how to use generative models. See [this page](generative_models.md) for more information on how to use generative models.
#### Text Generation #### Text Generation
Specified using `--task generate`. Specified using `--task generate`.
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | | `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ | | `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ |
...@@ -412,7 +411,7 @@ See [this page](./pooling_models.md) for more information on how to use pooling ...@@ -412,7 +411,7 @@ See [this page](./pooling_models.md) for more information on how to use pooling
Specified using `--task embed`. Specified using `--task embed`.
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | | | `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
| `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ | | `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
...@@ -448,7 +447,7 @@ of the whole prompt are extracted from the normalized hidden state corresponding ...@@ -448,7 +447,7 @@ of the whole prompt are extracted from the normalized hidden state corresponding
Specified using `--task reward`. Specified using `--task reward`.
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ | | `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ | | `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
...@@ -466,7 +465,7 @@ If your model is not in the above list, we will try to automatically convert the ...@@ -466,7 +465,7 @@ If your model is not in the above list, we will try to automatically convert the
Specified using `--task classify`. Specified using `--task classify`.
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | | | `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ | | `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
...@@ -527,7 +526,7 @@ On the other hand, modalities separated by `/` are mutually exclusive. ...@@ -527,7 +526,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.
- e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs. - e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
See [this page][multimodal-inputs] on how to pass multi-modal inputs to the model. See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inputs to the model.
!!! important !!! important
**To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference) **To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
...@@ -557,13 +556,13 @@ See [this page][multimodal-inputs] on how to pass multi-modal inputs to the mode ...@@ -557,13 +556,13 @@ See [this page][multimodal-inputs] on how to pass multi-modal inputs to the mode
### Generative Models ### Generative Models
See [this page][generative-models] for more information on how to use generative models. See [this page](generative_models.md) for more information on how to use generative models.
#### Text Generation #### Text Generation
Specified using `--task generate`. Specified using `--task generate`.
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) | | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ | | `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
| `AyaVisionForConditionalGeneration` | Aya Vision | T + I<sup>+</sup> | `CohereForAI/aya-vision-8b`, `CohereForAI/aya-vision-32b`, etc. | | ✅︎ | ✅︎ | | `AyaVisionForConditionalGeneration` | Aya Vision | T + I<sup>+</sup> | `CohereForAI/aya-vision-8b`, `CohereForAI/aya-vision-32b`, etc. | | ✅︎ | ✅︎ |
...@@ -685,7 +684,7 @@ Specified using `--task transcription`. ...@@ -685,7 +684,7 @@ Specified using `--task transcription`.
Speech2Text models trained specifically for Automatic Speech Recognition. Speech2Text models trained specifically for Automatic Speech Recognition.
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `WhisperForConditionalGeneration` | Whisper | `openai/whisper-small`, `openai/whisper-large-v3-turbo`, etc. | | | | | `WhisperForConditionalGeneration` | Whisper | `openai/whisper-small`, `openai/whisper-large-v3-turbo`, etc. | | | |
...@@ -708,7 +707,7 @@ Any text generation model can be converted into an embedding model by passing `- ...@@ -708,7 +707,7 @@ Any text generation model can be converted into an embedding model by passing `-
The following table lists those that are tested in vLLM. The following table lists those that are tested in vLLM.
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) | | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------| |--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | | | `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
| `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | | | `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
......
--- ---
title: Distributed Inference and Serving title: Distributed Inference and Serving
--- ---
[](){ #distributed-serving }
## How to decide the distributed inference strategy? ## How to decide the distributed inference strategy?
......
--- ---
title: LangChain title: LangChain
--- ---
[](){ #serving-langchain }
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) . vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
......
--- ---
title: LlamaIndex title: LlamaIndex
--- ---
[](){ #serving-llamaindex }
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) . vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment