Unverified Commit b4bab816 authored by Harry Mellor's avatar Harry Mellor Committed by GitHub
Browse files

Remove unnecessary explicit title anchors and use relative links instead (#20620)


Signed-off-by: default avatarHarry Mellor <19981378+hmellor@users.noreply.github.com>
parent b91cb3fa
---
title: INT8 W8A8
---
[](){ #int8 }
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance.
......
---
title: Quantized KV Cache
---
[](){ #quantized-kvcache }
## FP8 KV Cache
......
---
title: AMD Quark
---
[](){ #quark }
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
......
---
title: Supported Hardware
---
[](){ #quantization-supported-hardware }
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
......
---
title: Reasoning Outputs
---
[](){ #reasoning-outputs }
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
......
---
title: Speculative Decoding
---
[](){ #spec-decode }
!!! warning
Please note that speculative decoding in vLLM is not yet optimized and does
......@@ -269,7 +268,7 @@ speculative decoding, breaking down the guarantees into three key areas:
3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors:
......@@ -278,7 +277,7 @@ can occur due to following factors:
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability.
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](../usage/faq.md).
## Resources for vLLM contributors
......
---
title: Structured Outputs
---
[](){ #structured-outputs }
vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or
......@@ -21,7 +20,7 @@ The following parameters are supported, which must be added as extra parameters:
- `guided_grammar`: the output will follow the context free grammar.
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page.
You can see the complete list of supported parameters on the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) page.
Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the
......
---
title: Installation
---
[](){ #installation-index }
vLLM supports the following hardware platforms:
......
......@@ -109,8 +109,8 @@ docker run \
### Supported features
- [Offline inference][offline-inference]
- Online serving via [OpenAI-Compatible Server][serving-openai-compatible-server]
- [Offline inference](../../serving/offline_inference.md)
- Online serving via [OpenAI-Compatible Server](../../serving/openai_compatible_server.md)
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
......
---
title: Quickstart
---
[](){ #quickstart }
This guide will help you quickly get started with vLLM to perform:
......@@ -43,7 +42,7 @@ uv pip install vllm --torch-backend=auto
```
!!! note
For more detail and non-CUDA platforms, please refer [here][installation-index] for specific instructions on how to install vLLM.
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
[](){ #quickstart-offline }
......@@ -77,7 +76,7 @@ prompts = [
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
```
The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here][supported-models].
The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here](../models/supported_models.md).
```python
llm = LLM(model="facebook/opt-125m")
......
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import itertools
import logging
from dataclasses import dataclass, field
from pathlib import Path
from typing import Literal
import regex as re
logger = logging.getLogger("mkdocs")
ROOT_DIR = Path(__file__).parent.parent.parent.parent
ROOT_DIR_RELATIVE = '../../../../..'
EXAMPLE_DIR = ROOT_DIR / "examples"
EXAMPLE_DOC_DIR = ROOT_DIR / "docs/examples"
print(ROOT_DIR.resolve())
print(EXAMPLE_DIR.resolve())
print(EXAMPLE_DOC_DIR.resolve())
def fix_case(text: str) -> str:
......@@ -135,6 +135,11 @@ class Example:
def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
logger.info("Generating example documentation")
logger.debug("Root directory: %s", ROOT_DIR.resolve())
logger.debug("Example directory: %s", EXAMPLE_DIR.resolve())
logger.debug("Example document directory: %s", EXAMPLE_DOC_DIR.resolve())
# Create the EXAMPLE_DOC_DIR if it doesn't exist
if not EXAMPLE_DOC_DIR.exists():
EXAMPLE_DOC_DIR.mkdir(parents=True)
......@@ -156,7 +161,7 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
for example in sorted(examples, key=lambda e: e.path.stem):
example_name = f"{example.path.stem}.md"
doc_path = EXAMPLE_DOC_DIR / example.category / example_name
print(doc_path)
logger.debug("Example generated: %s", doc_path.relative_to(ROOT_DIR))
if not doc_path.parent.exists():
doc_path.parent.mkdir(parents=True)
with open(doc_path, "w+") as f:
......
---
title: Loading models with Run:ai Model Streamer
---
[](){ #runai-model-streamer }
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
......
---
title: Loading models with CoreWeave's Tensorizer
---
[](){ #tensorizer }
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
......
---
title: Generative Models
---
[](){ #generative-models }
vLLM provides first-class support for generative models, which covers most of LLMs.
......@@ -134,7 +133,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
## Online Serving
Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs:
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
- [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
- [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template.
- [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.
---
title: TPU
---
[](){ #tpu-supported-models }
# TPU Supported Models
## Text-only Language Models
......
---
title: Pooling Models
---
[](){ #pooling-models }
vLLM also supports pooling models, including embedding, reranking and reward models.
......@@ -11,7 +10,7 @@ before returning them.
!!! note
We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix][compatibility-matrix], most vLLM features are not applicable to
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
For pooling models, we support the following `--task` options.
......@@ -113,10 +112,10 @@ A code example can be found here: <gh-file:examples/offline_inference/basic/scor
## Online Serving
Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs:
Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
- [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs][multimodal-inputs] for embedding models.
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
- [Classification API][classification-api] is similar to `LLM.classify` and is applicable to sequence classification models.
- [Score API][score-api] is similar to `LLM.score` for cross-encoder models.
......
---
title: Supported Models
---
[](){ #supported-models }
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
If a model supports more than one task, you can set the task via the `--task` argument.
......@@ -34,7 +33,7 @@ llm.apply_model(lambda model: print(type(model)))
If it is `TransformersForCausalLM` then it means it's based on Transformers!
!!! tip
You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference][offline-inference] or `--model-impl transformers` for the [openai-compatible-server][serving-openai-compatible-server].
You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference](../serving/offline_inference.md) or `--model-impl transformers` for the [openai-compatible-server](../serving/openai_compatible_server.md).
!!! note
vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
......@@ -53,8 +52,8 @@ For a model to be compatible with the Transformers backend for vLLM it must:
If the compatible model is:
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][serving-openai-compatible-server].
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference][offline-inference] or `vllm serve <MODEL_DIR>` for the [openai-compatible-server][serving-openai-compatible-server].
- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference](../serving/offline_inference.md) or `--trust-remote-code` for the [openai-compatible-server](../serving/openai_compatible_server.md).
- in a local directory, simply pass directory path to `model=<MODEL_DIR>` for [offline-inference](../serving/offline_inference.md) or `vllm serve <MODEL_DIR>` for the [openai-compatible-server](../serving/openai_compatible_server.md).
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
......@@ -171,7 +170,7 @@ The [Transformers backend][transformers-backend] enables you to run models direc
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
Otherwise, please refer to [Adding a New Model][new-model] for instructions on how to implement your model in vLLM.
Otherwise, please refer to [Adding a New Model](../contributing/model/README.md) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
#### Download a model
......@@ -308,13 +307,13 @@ print(output)
### Generative Models
See [this page][generative-models] for more information on how to use generative models.
See [this page](generative_models.md) for more information on how to use generative models.
#### Text Generation
Specified using `--task generate`.
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ |
......@@ -412,7 +411,7 @@ See [this page](./pooling_models.md) for more information on how to use pooling
Specified using `--task embed`.
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
| `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
......@@ -448,7 +447,7 @@ of the whole prompt are extracted from the normalized hidden state corresponding
Specified using `--task reward`.
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
......@@ -466,7 +465,7 @@ If your model is not in the above list, we will try to automatically convert the
Specified using `--task classify`.
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
......@@ -527,7 +526,7 @@ On the other hand, modalities separated by `/` are mutually exclusive.
- e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
See [this page][multimodal-inputs] on how to pass multi-modal inputs to the model.
See [this page](../features/multimodal_inputs.md) on how to pass multi-modal inputs to the model.
!!! important
**To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
......@@ -557,13 +556,13 @@ See [this page][multimodal-inputs] on how to pass multi-modal inputs to the mode
### Generative Models
See [this page][generative-models] for more information on how to use generative models.
See [this page](generative_models.md) for more information on how to use generative models.
#### Text Generation
Specified using `--task generate`.
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
| `AyaVisionForConditionalGeneration` | Aya Vision | T + I<sup>+</sup> | `CohereForAI/aya-vision-8b`, `CohereForAI/aya-vision-32b`, etc. | | ✅︎ | ✅︎ |
......@@ -685,7 +684,7 @@ Specified using `--task transcription`.
Speech2Text models trained specifically for Automatic Speech Recognition.
| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `WhisperForConditionalGeneration` | Whisper | `openai/whisper-small`, `openai/whisper-large-v3-turbo`, etc. | | | |
......@@ -708,7 +707,7 @@ Any text generation model can be converted into an embedding model by passing `-
The following table lists those that are tested in vLLM.
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
| `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
......
---
title: Distributed Inference and Serving
---
[](){ #distributed-serving }
## How to decide the distributed inference strategy?
......
---
title: LangChain
---
[](){ #serving-langchain }
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
......
---
title: LlamaIndex
---
[](){ #serving-llamaindex }
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment