Migrate docs from Sphinx to MkDocs (#18145)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Migrate docs from Sphinx to MkDocs (#18145)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
a1fe24d9 · Harry Mellor · GitHub · d0bc2f81 · a1fe24d9 · a1fe24d9
Unverified Commit a1fe24d9 authored May 23, 2025 by Harry Mellor Committed by GitHub May 23, 2025
20 changed files
--- a/docs/source/features/lora.md
+++ b/docs/source/features/lora.md
-(lora-adapter)=
+---
+title: LoRA Adapters
-# LoRA Adapters
+---
+[](){ #lora-adapter }
 This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
-LoRA adapters can be used with any vLLM model that implements {class}`~vllm.model_executor.models.interfaces.SupportsLoRA`.
+LoRA adapters can be used with any vLLM model that implements [SupportsLoRA][vllm.model_executor.models.interfaces.SupportsLoRA].
 Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
 them locally with
@@ -60,9 +61,8 @@ vllm serve meta-llama/Llama-2-7b-hf \
    --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
 ```
-:::{note}
+!!! note
-The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
+    The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
-:::
 The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
 etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along

--- a/docs/source/features/multimodal_inputs.md
+++ b/docs/source/features/multimodal_inputs.md
-(multimodal-inputs)=
+---
+title: Multimodal Inputs
+---
+[](){ #multimodal-inputs }
-# Multimodal Inputs
+This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
-This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
+!!! note
+    We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
-:::{note}
+    and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
-We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
-and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
-:::
 ## Offline Inference
-To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`:
+To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
 - `prompt`: The prompt should follow the format that is documented on HuggingFace.
- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.inputs.MultiModalDataDict`.
+- `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.multimodal.inputs.MultiModalDataDict][].
 ### Image Inputs
@@ -211,16 +211,15 @@ for o in outputs:
 Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
-:::{important}
+!!! warning
-A chat template is **required** to use Chat Completions API.
+    A chat template is **required** to use Chat Completions API.
-For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
+    For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
-If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>.
+    If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>.
-If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
+    If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
-For certain models, we provide alternative chat templates inside <gh-dir:vllm/examples>.
+    For certain models, we provide alternative chat templates inside <gh-dir:vllm/examples>.
-For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision.
+    For example, VLM2Vec uses <gh-file:examples/template_vlm2vec.jinja> which is different from the default one for Phi-3-Vision.
-:::
 ### Image Inputs
@@ -284,25 +283,21 @@ print("Chat completion output:", chat_response.choices[0].message.content)
 Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
-:::{tip}
+!!! tip
-Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
+    Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
-and pass the file path as `url` in the API request.
+    and pass the file path as `url` in the API request.
-:::
-:::{tip}
-There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
-In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
-:::
-:::{note}
+!!! tip
-By default, the timeout for fetching images through HTTP URL is `5` seconds.
+    There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
-You can override this by setting the environment variable:
+    In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
-```console
+!!! note
-export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
+    By default, the timeout for fetching images through HTTP URL is `5` seconds.
-```
+    You can override this by setting the environment variable:
-:::
+    ```console
+    export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
+    ```
 ### Video Inputs
@@ -357,15 +352,13 @@ print("Chat completion output from image url:", result)
 Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
-:::{note}
+!!! note
-By default, the timeout for fetching videos through HTTP URL is `30` seconds.
+    By default, the timeout for fetching videos through HTTP URL is `30` seconds.
-You can override this by setting the environment variable:
+    You can override this by setting the environment variable:
-```console
+    ```console
-export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
+    export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
-```
+    ```
-:::
 ### Audio Inputs
@@ -461,15 +454,13 @@ print("Chat completion output from audio url:", result)
 Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
-:::{note}
+!!! note
-By default, the timeout for fetching audios through HTTP URL is `10` seconds.
+    By default, the timeout for fetching audios through HTTP URL is `10` seconds.
-You can override this by setting the environment variable:
+    You can override this by setting the environment variable:
-```console
-export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
-```
-:::
+    ```console
+    export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
+    ```
 ### Embedding Inputs
@@ -535,7 +526,6 @@ chat_completion = client.chat.completions.create(
 )
 ```
-:::{note}
+!!! note
-Only one message can contain `{"type": "image_embeds"}`.
+    Only one message can contain `{"type": "image_embeds"}`.
-If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
+    If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
-:::
--- a/docs/source/features/prompt_embeds.md
+++ b/docs/source/features/prompt_embeds.md
@@ -6,13 +6,12 @@ This page teaches you how to pass prompt embedding inputs to vLLM.
 The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary.
-:::{note}
+!!! note
-Prompt embeddings are currently only supported in the v0 engine.
+    Prompt embeddings are currently only supported in the v0 engine.
-:::
 ## Offline Inference
-To input multi-modal data, follow this schema in {class}`vllm.inputs.EmbedsPrompt`:
+To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:
 - `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model.

--- a/docs/features/quantization/README.md
+++ b/docs/features/quantization/README.md
+---
+title: Quantization
+---
+[](){ #quantization-index }
+Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
+Contents:
+- [Supported_Hardware](supported_hardware.md)
+- [Auto_Awq](auto_awq.md)
+- [Bnb](bnb.md)
+- [Bitblas](bitblas.md)
+- [Gguf](gguf.md)
+- [Gptqmodel](gptqmodel.md)
+- [Int4](int4.md)
+- [Int8](int8.md)
+- [Fp8](fp8.md)
+- [Modelopt](modelopt.md)
+- [Quark](quark.md)
+- [Quantized_Kvcache](quantized_kvcache.md)
+- [Torchao](torchao.md)
--- a/docs/source/features/quantization/auto_awq.md
+++ b/docs/source/features/quantization/auto_awq.md
-(auto-awq)=
+---
+title: AutoAWQ
-# AutoAWQ
+---
+[](){ #auto-awq }
 To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
 Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.

--- a/docs/source/features/quantization/bitblas.md
+++ b/docs/source/features/quantization/bitblas.md
-(bitblas)=
+---
+title: BitBLAS
-# BitBLAS
+---
+[](){ #bitblas }
 vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
-:::{note}
+!!! note
-Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
+    Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
-Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
+    Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
-For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
+    For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
-:::
 Below are the steps to utilize BitBLAS with vLLM.

--- a/docs/source/features/quantization/bnb.md
+++ b/docs/source/features/quantization/bnb.md
-(bits-and-bytes)=
+---
+title: BitsAndBytes
-# BitsAndBytes
+---
+[](){ #bits-and-bytes }
 vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
 BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.

--- a/docs/source/features/quantization/fp8.md
+++ b/docs/source/features/quantization/fp8.md
-(fp8)=
+---
+title: FP8 W8A8
-# FP8 W8A8
+---
+[](){ #fp8 }
 vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
 Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
@@ -14,10 +15,9 @@ The FP8 types typically supported in hardware have two distinct representations,
 - **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
 - **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
-:::{note}
+!!! note
-FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
+    FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
-FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
+    FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
-:::
 ## Installation
@@ -94,9 +94,8 @@ print(result[0].outputs[0].text)
 Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
-:::{note}
+!!! note
-Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
+    Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
-:::
 ```console
 $ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
@@ -133,6 +132,5 @@ result = model.generate("Hello, my name is")
 print(result[0].outputs[0].text)
 ```
-:::{warning}
+!!! warning
-Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
+    Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
-:::
--- a/docs/source/features/quantization/gguf.md
+++ b/docs/source/features/quantization/gguf.md
-(gguf)=
+---
+title: GGUF
+---
+[](){ #gguf }
-# GGUF
+!!! warning
+    Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
-:::{warning}
+!!! warning
-Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
+    Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
-:::
-:::{warning}
-Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
-:::
 To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
@@ -25,9 +24,8 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen
 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
 ```
-:::{warning}
+!!! warning
-We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
+    We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
-:::
 GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path

--- a/docs/source/features/quantization/gptqmodel.md
+++ b/docs/source/features/quantization/gptqmodel.md
-(gptqmodel)=
+---
+title: GPTQModel
-# GPTQModel
+---
+[](){ #gptqmodel }
 To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.

--- a/docs/source/features/quantization/int4.md
+++ b/docs/source/features/quantization/int4.md
-(int4)=
+---
+title: INT4 W4A16
-# INT4 W4A16
+---
+[](){ #int4 }
 vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
 Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int4-llms-for-vllm-668ec34bf3c9fa45f857df2c).
-:::{note}
+!!! note
-INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell).
+    INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell).
-:::
 ## Prerequisites
@@ -121,9 +121,8 @@ $ lm_eval --model vllm \
  --batch_size 'auto'
 ```
-:::{note}
+!!! note
-Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
+    Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
-:::
 ## Best Practices

--- a/docs/source/features/quantization/int8.md
+++ b/docs/source/features/quantization/int8.md
-(int8)=
+---
+title: INT8 W8A8
-# INT8 W8A8
+---
+[](){ #int8 }
 vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
 This quantization method is particularly useful for reducing model size while maintaining good performance.
 Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
-:::{note}
+!!! note
-INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell).
+    INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell).
-:::
 ## Prerequisites
@@ -125,9 +125,8 @@ $ lm_eval --model vllm \
  --batch_size 'auto'
 ```
-:::{note}
+!!! note
-Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
+    Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
-:::
 ## Best Practices

--- a/docs/source/features/quantization/modelopt.md
+++ b/docs/source/features/quantization/modelopt.md
--- a/docs/source/features/quantization/quantized_kvcache.md
+++ b/docs/source/features/quantization/quantized_kvcache.md
-(quantized-kvcache)=
+---
+title: Quantized KV Cache
-# Quantized KV Cache
+---
+[](){ #quantized-kvcache }
 ## FP8 KV Cache

--- a/docs/source/features/quantization/quark.md
+++ b/docs/source/features/quantization/quark.md
-(quark)=
+---
+title: AMD QUARK
-# AMD QUARK
+---
+[](){ #quark }
 Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
 throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
@@ -86,13 +87,12 @@ We need to set the quantization configuration, you can check
 for further details. Here we use FP8 per-tensor quantization on weight, activation,
 kv-cache and the quantization algorithm is AutoSmoothQuant.
-:::{note}
+!!! note
-Note the quantization algorithm needs a JSON config file and the config file is located in
+    Note the quantization algorithm needs a JSON config file and the config file is located in
-[Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html),
+    [Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html),
-under the directory `examples/torch/language_modeling/llm_ptq/models`. For example,
+    under the directory `examples/torch/language_modeling/llm_ptq/models`. For example,
-AutoSmoothQuant config file for Llama is
+    AutoSmoothQuant config file for Llama is
-`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
+    `examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
-:::
 ```python
 from quark.torch.quantization import (Config, QuantizationConfig,

--- a/docs/features/quantization/supported_hardware.md
+++ b/docs/features/quantization/supported_hardware.md
+---
+title: Supported Hardware
+---
+[](){ #quantization-supported-hardware }
+The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
+| Implementation        | Volta   | Turing   | Ampere   | Ada   | Hopper   | AMD GPU   | Intel GPU   | x86 CPU   | AWS Inferentia   | Google TPU   |
+|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|------------------|--------------|
+| AWQ                   | ❌       | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ✅︎          | ✅︎        | ❌                | ❌            |
+| GPTQ                  | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ✅︎          | ✅︎        | ❌                | ❌            |
+| Marlin (GPTQ/AWQ/FP8) | ❌       | ❌        | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌           | ❌         | ❌                | ❌            |
+| INT8 (W8A8)           | ❌       | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌           | ✅︎        | ❌                | ✅︎           |
+| FP8 (W8A8)            | ❌       | ❌        | ❌        | ✅︎    | ✅︎       | ✅︎        | ❌           | ❌         | ❌                | ❌            |
+| BitBLAS (GPTQ)        | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌           | ❌         | ❌                | ❌            |
+| AQLM                  | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌           | ❌         | ❌                | ❌            |
+| bitsandbytes          | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌           | ❌         | ❌                | ❌            |
+| DeepSpeedFP           | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌           | ❌         | ❌                | ❌            |
+| GGUF                  | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ✅︎        | ❌           | ❌         | ❌                | ❌            |
+- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
+- ✅︎ indicates that the quantization method is supported on the specified hardware.
+- ❌ indicates that the quantization method is not supported on the specified hardware.
+!!! note
+    This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
+    For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
--- a/docs/source/features/quantization/torchao.md
+++ b/docs/source/features/quantization/torchao.md
--- a/docs/source/features/reasoning_outputs.md
+++ b/docs/source/features/reasoning_outputs.md
-(reasoning-outputs)=
+---
+title: Reasoning Outputs
-# Reasoning Outputs
+---
+[](){ #reasoning-outputs }
 vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
@@ -17,10 +18,9 @@ vLLM currently supports the following reasoning models:
 | [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | ❌ | ❌ |
 | [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` | ✅ |
-:::{note}
+!!! note
-IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
+    IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
-The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
+    The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
-:::
 ## Quickstart
@@ -167,12 +167,10 @@ client = OpenAI(
 models = client.models.list()
 model = models.data[0].id
 class People(BaseModel):
    name: str
    age: int
 json_schema = People.model_json_schema()
 prompt = ("Generate a JSON with the name and age of one random person.")

--- a/docs/source/features/spec_decode.md
+++ b/docs/source/features/spec_decode.md
-(spec-decode)=
+---
+title: Speculative Decoding
+---
+[](){ #spec-decode }
-# Speculative Decoding
+!!! warning
+    Please note that speculative decoding in vLLM is not yet optimized and does
+    not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
+    The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
-:::{warning}
+!!! warning
-Please note that speculative decoding in vLLM is not yet optimized and does
+    Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
-not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
-The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
-:::
-:::{warning}
-Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
-:::
 This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
 Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
@@ -51,9 +50,8 @@ python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model
    --speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
 ```
-:::{warning}
+!!! warning
-Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now.
+    Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now.
-:::
 Then use a client:
@@ -255,7 +253,7 @@ speculative decoding, breaking down the guarantees into three key areas:
 3. **vLLM Logprob Stability**
   \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
   same request across runs. For more details, see the FAQ section
-   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
+   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
 While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
 can occur due to following factors:
@@ -264,7 +262,7 @@ can occur due to following factors:
 - **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
  due to non-deterministic behavior in batched operations or numerical instability.
-For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
 ## Resources for vLLM contributors

--- a/docs/source/features/structured_outputs.md
+++ b/docs/source/features/structured_outputs.md
-(structured-outputs)=
+---
+title: Structured Outputs
-# Structured Outputs
+---
+[](){ #structured-outputs }
 vLLM supports the generation of structured outputs using
 [xgrammar](https://github.com/mlc-ai/xgrammar) or
@@ -20,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters:
 - `guided_grammar`: the output will follow the context free grammar.
 - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
-You can see the complete list of supported parameters on the [OpenAI-Compatible Server](#openai-compatible-server) page.
+You can see the complete list of supported parameters on the [OpenAI-Compatible Server][openai-compatible-server] page.
 Structured outputs are supported by default in the OpenAI-Compatible Server. You
 may choose to specify the backend to use by setting the
@@ -83,13 +84,11 @@ class CarType(str, Enum):
    truck = "Truck"
    coupe = "Coupe"
 class CarDescription(BaseModel):
    brand: str
    model: str
    car_type: CarType
 json_schema = CarDescription.model_json_schema()
 completion = client.chat.completions.create(
@@ -105,11 +104,10 @@ completion = client.chat.completions.create(
 print(completion.choices[0].message.content)
 ```
-:::{tip}
+!!! tip
-While not strictly necessary, normally it´s better to indicate in the prompt the
+    While not strictly necessary, normally it´s better to indicate in the prompt the
-JSON schema and how the fields should be populated.  This can improve the
+    JSON schema and how the fields should be populated.  This can improve the
-results notably in most cases.
+    results notably in most cases.
-:::
 Finally we have the `guided_grammar` option, which is probably the most
 difficult to use, but it´s really powerful. It allows us to define complete
@@ -160,12 +158,10 @@ Here is a simple example demonstrating how to get structured output using Pydant
 from pydantic import BaseModel
 from openai import OpenAI
 class Info(BaseModel):
    name: str
    age: int
 client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
 completion = client.beta.chat.completions.parse(
    model="meta-llama/Llama-3.1-8B-Instruct",
@@ -199,17 +195,14 @@ from typing import List
 from pydantic import BaseModel
 from openai import OpenAI
 class Step(BaseModel):
    explanation: str
    output: str
 class MathResponse(BaseModel):
    steps: list[Step]
    final_answer: str
 client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
 completion = client.beta.chat.completions.parse(
    model="meta-llama/Llama-3.1-8B-Instruct",