[Doc] Convert docs to use colon fences (#12471)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[Doc] Convert docs to use colon fences (#12471)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
dd6a3a02 · Harry Mellor · GitHub · a7e3eba6 · dd6a3a02 · dd6a3a02
Unverified Commit dd6a3a02 authored Jan 29, 2025 by Harry Mellor Committed by GitHub Jan 29, 2025
20 changed files
--- a/docs/source/deployment/nginx.md
+++ b/docs/source/deployment/nginx.md
@@ -105,9 +105,9 @@ docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-si
 docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
 ```
-```{note}
+:::{note}
 If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
-```
+:::
 (nginxloadbalancer-nginx-launch-nginx)=

--- a/docs/source/design/arch_overview.md
+++ b/docs/source/design/arch_overview.md
@@ -4,19 +4,19 @@
 This document provides an overview of the vLLM architecture.
-```{contents} Table of Contents
+:::{contents} Table of Contents
 :depth: 2
 :local: true
-```
+:::
 ## Entrypoints
 vLLM provides a number of entrypoints for interacting with the system. The
 following diagram shows the relationship between them.
-```{image} /assets/design/arch_overview/entrypoints.excalidraw.png
+:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png
 :alt: Entrypoints Diagram
-```
+:::
 ### LLM Class
@@ -84,9 +84,9 @@ More details on the API server can be found in the [OpenAI-Compatible Server](#o
 The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
 the vLLM system, handling model inference and asynchronous request processing.
-```{image} /assets/design/arch_overview/llm_engine.excalidraw.png
+:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png
 :alt: LLMEngine Diagram
-```
+:::
 ### LLMEngine
@@ -144,11 +144,11 @@ configurations affect the class we ultimately get.
 The following figure shows the class hierarchy of vLLM:
-> ```{figure} /assets/design/hierarchy.png
+> :::{figure} /assets/design/hierarchy.png
 > :align: center
 > :alt: query
 > :width: 100%
-> ```
+> :::
 There are several important design choices behind this class hierarchy:
@@ -178,7 +178,7 @@ of a vision model and a language model. By making the constructor uniform, we
 can easily create a vision model and a language model and compose them into a
 vision-language model.
-````{note}
+:::{note}
 To support this change, all vLLM models' signatures have been updated to:
 ```python
@@ -215,7 +215,7 @@ else:
 ```
 This way, the model can work with both old and new versions of vLLM.
-````
+:::
 3\. **Sharding and Quantization at Initialization**: Certain features require
 changing the model weights. For example, tensor parallelism needs to shard the

--- a/docs/source/design/kernel/paged_attention.md
+++ b/docs/source/design/kernel/paged_attention.md
@@ -139,26 +139,26 @@
  const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
  ```
-  ```{figure} ../../assets/kernel/query.png
+  :::{figure} ../../assets/kernel/query.png
  :align: center
  :alt: query
  :width: 70%
  Query data of one token at one head
-  ```
+  :::
 - Each thread defines its own `q_ptr` which points to the assigned
  query token data on global memory. For example, if `VEC_SIZE` is 4
  and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
  total of 128 elements divided into 128 / 4 = 32 vecs.
-  ```{figure} ../../assets/kernel/q_vecs.png
+  :::{figure} ../../assets/kernel/q_vecs.png
  :align: center
  :alt: q_vecs
  :width: 70%
  `q_vecs` for one thread group
-  ```
+  :::
  ```cpp
  __shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
@@ -195,13 +195,13 @@
  points to key token data based on `k_cache` at assigned block,
  assigned head and assigned token.
-  ```{figure} ../../assets/kernel/key.png
+  :::{figure} ../../assets/kernel/key.png
  :align: center
  :alt: key
  :width: 70%
  Key data of all context tokens at one head
-  ```
+  :::
 - The diagram above illustrates the memory layout for key data. It
  assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
@@ -214,13 +214,13 @@
  elements for one token) that will be processed by 2 threads (one
  thread group) separately.
-  ```{figure} ../../assets/kernel/k_vecs.png
+  :::{figure} ../../assets/kernel/k_vecs.png
  :align: center
  :alt: k_vecs
  :width: 70%
  `k_vecs` for one thread
-  ```
+  :::
  ```cpp
  K_vec k_vecs[NUM_VECS_PER_THREAD]
@@ -289,14 +289,14 @@
  should be performed across the entire thread block, encompassing
  results between the query token and all context key tokens.
-  ```{math}
+  :::{math}
  :nowrap: true
  \begin{gather*}
  m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
  \quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
  \end{gather*}
-  ```
+  :::
 ### `qk_max` and `logits`
@@ -379,29 +379,29 @@
 ## Value
-```{figure} ../../assets/kernel/value.png
+:::{figure} ../../assets/kernel/value.png
 :align: center
 :alt: value
 :width: 70%
 Value data of all context tokens at one head
-```
+:::
-```{figure} ../../assets/kernel/logits_vec.png
+:::{figure} ../../assets/kernel/logits_vec.png
 :align: center
 :alt: logits_vec
 :width: 50%
 `logits_vec` for one thread
-```
+:::
-```{figure} ../../assets/kernel/v_vec.png
+:::{figure} ../../assets/kernel/v_vec.png
 :align: center
 :alt: v_vec
 :width: 70%
 List of `v_vec` for one thread
-```
+:::
 - Now we need to retrieve the value data and perform dot multiplication
  with `logits`. Unlike query and key, there is no thread group

--- a/docs/source/design/multiprocessing.md
+++ b/docs/source/design/multiprocessing.md
@@ -7,9 +7,9 @@ page for information on known issues and how to solve them.
 ## Introduction
-```{important}
+:::{important}
 The source code references are to the state of the code at the time of writing in December, 2024.
-```
+:::
 The use of Python multiprocessing in vLLM is complicated by:

--- a/docs/source/features/automatic_prefix_caching.md
+++ b/docs/source/features/automatic_prefix_caching.md
@@ -6,9 +6,9 @@
 Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
-```{note}
+:::{note}
 Technical details on how vLLM implements APC can be found [here](#design-automatic-prefix-caching).
-```
+:::
 ## Enabling APC in vLLM

--- a/docs/source/features/compatibility_matrix.md
+++ b/docs/source/features/compatibility_matrix.md
@@ -4,13 +4,13 @@
 The tables below show mutually exclusive features and the support on some hardware.
-```{note}
+:::{note}
 Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
-```
+:::
 ## Feature x Feature
-```{raw} html
+:::{raw} html
 <style>
  /* Make smaller to try to improve readability  */
  td {
@@ -23,448 +23,447 @@ Check the '✗' with links to see tracking issue for unsupported feature/hardwar
    font-size: 0.8rem;
  }
 </style>
-```
+:::
-```{list-table}
+:::{list-table}
-   :header-rows: 1
+:header-rows: 1
-   :stub-columns: 1
+:stub-columns: 1
-   :widths: auto
+:widths: auto
-   * - Feature
+- * Feature
-     - [CP](#chunked-prefill)
+  * [CP](#chunked-prefill)
-     - [APC](#automatic-prefix-caching)
+  * [APC](#automatic-prefix-caching)
-     - [LoRA](#lora-adapter)
+  * [LoRA](#lora-adapter)
-     - <abbr title="Prompt Adapter">prmpt adptr</abbr>
+  * <abbr title="Prompt Adapter">prmpt adptr</abbr>
-     - [SD](#spec_decode)
+  * [SD](#spec_decode)
-     - CUDA graph
+  * CUDA graph
-     - <abbr title="Pooling Models">pooling</abbr>
+  * <abbr title="Pooling Models">pooling</abbr>
-     - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
+  * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
-     - <abbr title="Logprobs">logP</abbr>
+  * <abbr title="Logprobs">logP</abbr>
-     - <abbr title="Prompt Logprobs">prmpt logP</abbr>
+  * <abbr title="Prompt Logprobs">prmpt logP</abbr>
-     - <abbr title="Async Output Processing">async output</abbr>
+  * <abbr title="Async Output Processing">async output</abbr>
-     - multi-step
+  * multi-step
-     - <abbr title="Multimodal Inputs">mm</abbr>
+  * <abbr title="Multimodal Inputs">mm</abbr>
-     - best-of
+  * best-of
-     - beam-search
+  * beam-search
-     - <abbr title="Guided Decoding">guided dec</abbr>
+  * <abbr title="Guided Decoding">guided dec</abbr>
-   * - [CP](#chunked-prefill)
+- * [CP](#chunked-prefill)
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - [APC](#automatic-prefix-caching)
+- * [APC](#automatic-prefix-caching)
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - [LoRA](#lora-adapter)
+- * [LoRA](#lora-adapter)
-     - [✗](gh-pr:9057)
+  * [✗](gh-pr:9057)
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - <abbr title="Prompt Adapter">prmpt adptr</abbr>
+- * <abbr title="Prompt Adapter">prmpt adptr</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - [SD](#spec_decode)
+- * [SD](#spec_decode)
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - CUDA graph
+- * CUDA graph
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - <abbr title="Pooling Models">pooling</abbr>
+- * <abbr title="Pooling Models">pooling</abbr>
-     - ✗
+  * ✗
-     - ✗
+  * ✗
-     - ✗
+  * ✗
-     - ✗
+  * ✗
-     - ✗
+  * ✗
-     - ✗
+  * ✗
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
+- * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
-     - ✗
+  * ✗
-     - [✗](gh-issue:7366)
+  * [✗](gh-issue:7366)
-     - ✗
+  * ✗
-     - ✗
+  * ✗
-     - [✗](gh-issue:7366)
+  * [✗](gh-issue:7366)
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - <abbr title="Logprobs">logP</abbr>
+- * <abbr title="Logprobs">logP</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - <abbr title="Prompt Logprobs">prmpt logP</abbr>
+- * <abbr title="Prompt Logprobs">prmpt logP</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - [✗](gh-pr:8199)
+  * [✗](gh-pr:8199)
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - <abbr title="Async Output Processing">async output</abbr>
+- * <abbr title="Async Output Processing">async output</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - multi-step
+- * multi-step
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     - [✗](gh-issue:8198)
+  * [✗](gh-issue:8198)
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - <abbr title="Multimodal Inputs">mm</abbr>
+- * <abbr title="Multimodal Inputs">mm</abbr>
-     - ✅
+  * ✅
-     -  [✗](gh-pr:8348)
+  * [✗](gh-pr:8348)
-     -  [✗](gh-pr:7199)
+  * [✗](gh-pr:7199)
-     - ?
+  * ?
-     - ?
+  * ?
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ?
+  * ?
-     -
+  *
-     -
+  *
-     -
+  *
-     -
+  *
-   * - best-of
+- * best-of
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - [✗](gh-issue:6137)
+  * [✗](gh-issue:6137)
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ?
+  * ?
-     - [✗](gh-issue:7968)
+  * [✗](gh-issue:7968)
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-     -
+  *
-   * - beam-search
+- * beam-search
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - [✗](gh-issue:6137)
+  * [✗](gh-issue:6137)
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ?
+  * ?
-     - [✗](gh-issue:7968>)
+  * [✗](gh-issue:7968>)
-     - ?
+  * ?
-     - ✅
+  * ✅
-     -
+  *
-     -
+  *
-   * - <abbr title="Guided Decoding">guided dec</abbr>
+- * <abbr title="Guided Decoding">guided dec</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ?
+  * ?
-     - ?
+  * ?
-     - [✗](gh-issue:11484)
+  * [✗](gh-issue:11484)
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ?
+  * ?
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - [✗](gh-issue:9893)
+  * [✗](gh-issue:9893)
-     - ?
+  * ?
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     -
+  *
+:::
-```
 (feature-x-hardware)=
 ## Feature x Hardware
-```{list-table}
+:::{list-table}
-   :header-rows: 1
+:header-rows: 1
-   :stub-columns: 1
+:stub-columns: 1
-   :widths: auto
+:widths: auto
-   * - Feature
+- * Feature
-     - Volta
+  * Volta
-     - Turing
+  * Turing
-     - Ampere
+  * Ampere
-     - Ada
+  * Ada
-     - Hopper
+  * Hopper
-     - CPU
+  * CPU
-     - AMD
+  * AMD
-   * - [CP](#chunked-prefill)
+- * [CP](#chunked-prefill)
-     - [✗](gh-issue:2729)
+  * [✗](gh-issue:2729)
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-   * - [APC](#automatic-prefix-caching)
+- * [APC](#automatic-prefix-caching)
-     - [✗](gh-issue:3687)
+  * [✗](gh-issue:3687)
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-   * - [LoRA](#lora-adapter)
+- * [LoRA](#lora-adapter)
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-   * - <abbr title="Prompt Adapter">prmpt adptr</abbr>
+- * <abbr title="Prompt Adapter">prmpt adptr</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - [✗](gh-issue:8475)
+  * [✗](gh-issue:8475)
-     - ✅
+  * ✅
-   * - [SD](#spec_decode)
+- * [SD](#spec_decode)
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-   * - CUDA graph
+- * CUDA graph
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✅
+  * ✅
-   * - <abbr title="Pooling Models">pooling</abbr>
+- * <abbr title="Pooling Models">pooling</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ?
+  * ?
-   * - <abbr title="Encoder-Decoder Models">enc-dec</abbr>
+- * <abbr title="Encoder-Decoder Models">enc-dec</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-   * - <abbr title="Multimodal Inputs">mm</abbr>
+- * <abbr title="Multimodal Inputs">mm</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-   * - <abbr title="Logprobs">logP</abbr>
+- * <abbr title="Logprobs">logP</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-   * - <abbr title="Prompt Logprobs">prmpt logP</abbr>
+- * <abbr title="Prompt Logprobs">prmpt logP</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-   * - <abbr title="Async Output Processing">async output</abbr>
+- * <abbr title="Async Output Processing">async output</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✗
+  * ✗
-     - ✗
+  * ✗
-   * - multi-step
+- * multi-step
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - [✗](gh-issue:8477)
+  * [✗](gh-issue:8477)
-     - ✅
+  * ✅
-   * - best-of
+- * best-of
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-   * - beam-search
+- * beam-search
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-   * - <abbr title="Guided Decoding">guided dec</abbr>
+- * <abbr title="Guided Decoding">guided dec</abbr>
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-     - ✅
+  * ✅
-```
+:::
--- a/docs/source/features/disagg_prefill.md
+++ b/docs/source/features/disagg_prefill.md
@@ -4,9 +4,9 @@
 This page introduces you the disaggregated prefilling feature in vLLM.
-```{note}
+:::{note}
 This feature is experimental and subject to change.
-```
+:::
 ## Why disaggregated prefilling?
@@ -15,9 +15,9 @@ Two main reasons:
 - **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
 - **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
-```{note}
+:::{note}
 Disaggregated prefill DOES NOT improve throughput.
-```
+:::
 ## Usage example
@@ -39,21 +39,21 @@ Key abstractions for disaggregated prefilling:
 - **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
 - **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
-```{note}
+:::{note}
 `insert` is non-blocking operation but `drop_select` is blocking operation.
-```
+:::
 Here is a figure illustrating how the above 3 abstractions are organized:
-```{image} /assets/features/disagg_prefill/abstraction.jpg
+:::{image} /assets/features/disagg_prefill/abstraction.jpg
 :alt: Disaggregated prefilling abstractions
-```
+:::
 The workflow of disaggregated prefilling is as follows:
-```{image} /assets/features/disagg_prefill/overview.jpg
+:::{image} /assets/features/disagg_prefill/overview.jpg
 :alt: Disaggregated prefilling workflow
-```
+:::
 The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.

--- a/docs/source/features/lora.md
+++ b/docs/source/features/lora.md
@@ -60,9 +60,9 @@ vllm serve meta-llama/Llama-2-7b-hf \
    --lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
 ```
-```{note}
+:::{note}
 The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
-```
+:::
 The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
 etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along

--- a/docs/source/features/quantization/auto_awq.md
+++ b/docs/source/features/quantization/auto_awq.md
@@ -2,11 +2,11 @@
 # AutoAWQ
-```{warning}
+:::{warning}
 Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
 accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
 inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
-```
+:::
 To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
 Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.

--- a/docs/source/features/quantization/fp8.md
+++ b/docs/source/features/quantization/fp8.md
@@ -14,10 +14,10 @@ The FP8 types typically supported in hardware have two distinct representations,
 - **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
 - **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
-```{note}
+:::{note}
 FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
 FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
-```
+:::
 ## Quick Start with Online Dynamic Quantization
@@ -32,9 +32,9 @@ model = LLM("facebook/opt-125m", quantization="fp8")
 result = model.generate("Hello, my name is")
 ```
-```{warning}
+:::{warning}
 Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
-```
+:::
 ## Installation
@@ -110,9 +110,9 @@ model.generate("Hello my name is")
 Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
-```{note}
+:::{note}
 Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
-```
+:::
 ```console
 $ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
@@ -137,10 +137,10 @@ If you encounter any issues or have feature requests, please open an issue on th
 ## Deprecated Flow
-```{note}
+:::{note}
 The following information is preserved for reference and search purposes.
 The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
-```
+:::
 For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).

--- a/docs/source/features/quantization/gguf.md
+++ b/docs/source/features/quantization/gguf.md
@@ -2,13 +2,13 @@
 # GGUF
-```{warning}
+:::{warning}
 Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
-```
+:::
-```{warning}
+:::{warning}
 Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
-```
+:::
 To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
@@ -25,9 +25,9 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen
 vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
 ```
-```{warning}
+:::{warning}
 We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
-```
+:::
 You can also use the GGUF model directly through the LLM entrypoint:

--- a/docs/source/features/quantization/index.md
+++ b/docs/source/features/quantization/index.md
@@ -4,7 +4,7 @@
 Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
-```{toctree}
+:::{toctree}
 :caption: Contents
 :maxdepth: 1
@@ -15,4 +15,4 @@ gguf
 int8
 fp8
 quantized_kvcache
-```
+:::
--- a/docs/source/features/quantization/int8.md
+++ b/docs/source/features/quantization/int8.md
@@ -7,9 +7,9 @@ This quantization method is particularly useful for reducing model size while ma
 Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
-```{note}
+:::{note}
 INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
-```
+:::
 ## Prerequisites
@@ -119,9 +119,9 @@ $ lm_eval --model vllm \
  --batch_size 'auto'
 ```
-```{note}
+:::{note}
 Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
-```
+:::
 ## Best Practices

--- a/docs/source/features/quantization/supported_hardware.md
+++ b/docs/source/features/quantization/supported_hardware.md
@@ -4,128 +4,129 @@
 The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
-```{list-table}
+:::{list-table}
 :header-rows: 1
 :widths: 20 8 8 8 8 8 8 8 8 8 8
-* - Implementation
+- * Implementation
-  - Volta
+  * Volta
-  - Turing
+  * Turing
-  - Ampere
+  * Ampere
-  - Ada
+  * Ada
-  - Hopper
+  * Hopper
-  - AMD GPU
+  * AMD GPU
-  - Intel GPU
+  * Intel GPU
-  - x86 CPU
+  * x86 CPU
-  - AWS Inferentia
+  * AWS Inferentia
-  - Google TPU
+  * Google TPU
-* - AWQ
+- * AWQ
-  - ✗
+  * ✗
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-* - GPTQ
+- * GPTQ
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-* - Marlin (GPTQ/AWQ/FP8)
+- * Marlin (GPTQ/AWQ/FP8)
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-* - INT8 (W8A8)
+- * INT8 (W8A8)
-  - ✗
+  * ✗
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-* - FP8 (W8A8)
+- * FP8 (W8A8)
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-* - AQLM
+- * AQLM
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-* - bitsandbytes
+- * bitsandbytes
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-* - DeepSpeedFP
+- * DeepSpeedFP
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-* - GGUF
+- * GGUF
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✅︎
+  * ✅︎
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-  - ✗
+  * ✗
-```
+:::
 - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
 - "✅︎" indicates that the quantization method is supported on the specified hardware.
 - "✗" indicates that the quantization method is not supported on the specified hardware.
-```{note}
+:::{note}
 This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
 For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
-```
+:::
--- a/docs/source/features/spec_decode.md
+++ b/docs/source/features/spec_decode.md
@@ -2,15 +2,15 @@
 # Speculative Decoding
-```{warning}
+:::{warning}
 Please note that speculative decoding in vLLM is not yet optimized and does
 not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
 The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
-```
+:::
-```{warning}
+:::{warning}
 Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
-```
+:::
 This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
 Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.

--- a/docs/source/features/structured_outputs.md
+++ b/docs/source/features/structured_outputs.md
@@ -95,10 +95,10 @@ completion = client.chat.completions.create(
 print(completion.choices[0].message.content)
 ```
-```{tip}
+:::{tip}
 While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
 This can improve the results notably in most cases.
-```
+:::
 Finally we have the `guided_grammar`, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
 It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:

--- a/docs/source/generate_examples.py
+++ b/docs/source/generate_examples.py
@@ -57,9 +57,9 @@ class Index:
    def generate(self) -> str:
        content = f"# {self.title}\n\n{self.description}\n\n"
-        content += "```{toctree}\n"
+        content += ":::{toctree}\n"
        content += f":caption: {self.caption}\n:maxdepth: {self.maxdepth}\n"
-        content += "\n".join(self.documents) + "\n```\n"
+        content += "\n".join(self.documents) + "\n:::\n"
        return content

--- a/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
+++ b/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
@@ -86,9 +86,9 @@ docker build -f Dockerfile.hpu -t vllm-hpu-env  .
 docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
 ```
-```{tip}
+:::{tip}
 If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
-```
+:::
 ## Extra information
@@ -155,30 +155,30 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
 Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.
-```{list-table} vLLM execution modes
+:::{list-table} vLLM execution modes
 :widths: 25 25 50
 :header-rows: 1
-* - `PT_HPU_LAZY_MODE`
+- * `PT_HPU_LAZY_MODE`
-  - `enforce_eager`
+  * `enforce_eager`
-  - execution mode
+  * execution mode
-* - 0
+- * 0
-  - 0
+  * 0
-  - torch.compile
+  * torch.compile
-* - 0
+- * 0
-  - 1
+  * 1
-  - PyTorch eager mode
+  * PyTorch eager mode
-* - 1
+- * 1
-  - 0
+  * 0
-  - HPU Graphs
+  * HPU Graphs
-* - 1
+- * 1
-  - 1
+  * 1
-  - PyTorch lazy mode
+  * PyTorch lazy mode
-```
+:::
-```{warning}
+:::{warning}
 In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
-```
+:::
 (gaudi-bucketing-mechanism)=
@@ -187,9 +187,9 @@ In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and
 Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
 In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`.
-```{note}
+:::{note}
 Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
-```
+:::
 Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
@@ -222,15 +222,15 @@ min = 128, step = 128, max = 512
 In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.
-```{warning}
+:::{warning}
 If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
-```
+:::
 As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.
-```{note}
+:::{note}
 Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
-```
+:::
 ### Warmup
@@ -252,9 +252,9 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size
 This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
-```{tip}
+:::{tip}
 Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
-```
+:::
 ### HPU Graph capture
@@ -269,9 +269,9 @@ With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory wil
 Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints.
 Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.
-```{note}
+:::{note}
 `gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
-```
+:::
 User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
 \- `max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode
@@ -279,9 +279,9 @@ User can also configure the strategy for capturing HPU Graphs for prompt and dec
 When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
-```{note}
+:::{note}
 `VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
-```
+:::
 Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
@@ -352,13 +352,13 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
 - `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism
-  - `{phase}` is either `PROMPT` or `DECODE`
+  * `{phase}` is either `PROMPT` or `DECODE`
-  - `{dim}` is either `BS`, `SEQ` or `BLOCK`
+  * `{dim}` is either `BS`, `SEQ` or `BLOCK`
-  - `{param}` is either `MIN`, `STEP` or `MAX`
+  * `{param}` is either `MIN`, `STEP` or `MAX`
-  - Default values:
+  * Default values:
    - Prompt:
      - batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1`

--- a/docs/source/getting_started/installation/ai_accelerator/index.md
+++ b/docs/source/getting_started/installation/ai_accelerator/index.md
@@ -2,374 +2,374 @@
 vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} TPU
+::::{tab-item} TPU
 :sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
 :start-after: "# Installation"
 :end-before: "## Requirements"
-```
 :::
-:::{tab-item} Intel Gaudi
+::::
+::::{tab-item} Intel Gaudi
 :sync: hpu-gaudi
-```{include} hpu-gaudi.inc.md
+:::{include} hpu-gaudi.inc.md
 :start-after: "# Installation"
 :end-before: "## Requirements"
-```
 :::
-:::{tab-item} Neuron
+::::
+::::{tab-item} Neuron
 :sync: neuron
-```{include} neuron.inc.md
+:::{include} neuron.inc.md
 :start-after: "# Installation"
 :end-before: "## Requirements"
-```
 :::
-:::{tab-item} OpenVINO
+::::
+::::{tab-item} OpenVINO
 :sync: openvino
-```{include} openvino.inc.md
+:::{include} openvino.inc.md
 :start-after: "# Installation"
 :end-before: "## Requirements"
-```
 :::
 ::::
+:::::
 ## Requirements
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} TPU
+::::{tab-item} TPU
 :sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
 :start-after: "## Requirements"
 :end-before: "## Configure a new environment"
-```
 :::
-:::{tab-item} Intel Gaudi
+::::
+::::{tab-item} Intel Gaudi
 :sync: hpu-gaudi
-```{include} hpu-gaudi.inc.md
+:::{include} hpu-gaudi.inc.md
 :start-after: "## Requirements"
 :end-before: "## Configure a new environment"
-```
 :::
-:::{tab-item} Neuron
+::::
+::::{tab-item} Neuron
 :sync: neuron
-```{include} neuron.inc.md
+:::{include} neuron.inc.md
 :start-after: "## Requirements"
 :end-before: "## Configure a new environment"
-```
 :::
-:::{tab-item} OpenVINO
+::::
+::::{tab-item} OpenVINO
 :sync: openvino
-```{include} openvino.inc.md
+:::{include} openvino.inc.md
 :start-after: "## Requirements"
 :end-before: "## Set up using Python"
-```
 :::
 ::::
+:::::
 ## Configure a new environment
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} TPU
+::::{tab-item} TPU
 :sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
 :start-after: "## Configure a new environment"
 :end-before: "## Set up using Python"
-```
 :::
-:::{tab-item} Intel Gaudi
+::::
+::::{tab-item} Intel Gaudi
 :sync: hpu-gaudi
-```{include} hpu-gaudi.inc.md
+:::{include} hpu-gaudi.inc.md
 :start-after: "## Configure a new environment"
 :end-before: "## Set up using Python"
-```
 :::
-:::{tab-item} Neuron
+::::
+::::{tab-item} Neuron
 :sync: neuron
-```{include} neuron.inc.md
+:::{include} neuron.inc.md
 :start-after: "## Configure a new environment"
 :end-before: "## Set up using Python"
-```
 :::
-:::{tab-item} OpenVINO
+::::
-:sync: openvino
-```{include} ../python_env_setup.inc.md
+::::{tab-item} OpenVINO
-```
+:sync: openvino
+:::{include} ../python_env_setup.inc.md
 :::
 ::::
+:::::
 ## Set up using Python
 ### Pre-built wheels
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} TPU
+::::{tab-item} TPU
 :sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
 :start-after: "### Pre-built wheels"
 :end-before: "### Build wheel from source"
-```
 :::
-:::{tab-item} Intel Gaudi
+::::
+::::{tab-item} Intel Gaudi
 :sync: hpu-gaudi
-```{include} hpu-gaudi.inc.md
+:::{include} hpu-gaudi.inc.md
 :start-after: "### Pre-built wheels"
 :end-before: "### Build wheel from source"
-```
 :::
-:::{tab-item} Neuron
+::::
+::::{tab-item} Neuron
 :sync: neuron
-```{include} neuron.inc.md
+:::{include} neuron.inc.md
 :start-after: "### Pre-built wheels"
 :end-before: "### Build wheel from source"
-```
 :::
-:::{tab-item} OpenVINO
+::::
+::::{tab-item} OpenVINO
 :sync: openvino
-```{include} openvino.inc.md
+:::{include} openvino.inc.md
 :start-after: "### Pre-built wheels"
 :end-before: "### Build wheel from source"
-```
 :::
 ::::
+:::::
 ### Build wheel from source
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} TPU
+::::{tab-item} TPU
 :sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
 :start-after: "### Build wheel from source"
 :end-before: "## Set up using Docker"
-```
 :::
-:::{tab-item} Intel Gaudi
+::::
+::::{tab-item} Intel Gaudi
 :sync: hpu-gaudi
-```{include} hpu-gaudi.inc.md
+:::{include} hpu-gaudi.inc.md
 :start-after: "### Build wheel from source"
 :end-before: "## Set up using Docker"
-```
 :::
-:::{tab-item} Neuron
+::::
+::::{tab-item} Neuron
 :sync: neuron
-```{include} neuron.inc.md
+:::{include} neuron.inc.md
 :start-after: "### Build wheel from source"
 :end-before: "## Set up using Docker"
-```
 :::
-:::{tab-item} OpenVINO
+::::
+::::{tab-item} OpenVINO
 :sync: openvino
-```{include} openvino.inc.md
+:::{include} openvino.inc.md
 :start-after: "### Build wheel from source"
 :end-before: "## Set up using Docker"
-```
 :::
 ::::
+:::::
 ## Set up using Docker
 ### Pre-built images
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} TPU
+::::{tab-item} TPU
 :sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
 :start-after: "### Pre-built images"
 :end-before: "### Build image from source"
-```
 :::
-:::{tab-item} Intel Gaudi
+::::
+::::{tab-item} Intel Gaudi
 :sync: hpu-gaudi
-```{include} hpu-gaudi.inc.md
+:::{include} hpu-gaudi.inc.md
 :start-after: "### Pre-built images"
 :end-before: "### Build image from source"
-```
 :::
-:::{tab-item} Neuron
+::::
+::::{tab-item} Neuron
 :sync: neuron
-```{include} neuron.inc.md
+:::{include} neuron.inc.md
 :start-after: "### Pre-built images"
 :end-before: "### Build image from source"
-```
 :::
-:::{tab-item} OpenVINO
+::::
+::::{tab-item} OpenVINO
 :sync: openvino
-```{include} openvino.inc.md
+:::{include} openvino.inc.md
 :start-after: "### Pre-built images"
 :end-before: "### Build image from source"
-```
 :::
 ::::
+:::::
 ### Build image from source
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} TPU
+::::{tab-item} TPU
 :sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
 :start-after: "### Build image from source"
 :end-before: "## Extra information"
-```
 :::
-:::{tab-item} Intel Gaudi
+::::
+::::{tab-item} Intel Gaudi
 :sync: hpu-gaudi
-```{include} hpu-gaudi.inc.md
+:::{include} hpu-gaudi.inc.md
 :start-after: "### Build image from source"
 :end-before: "## Extra information"
-```
 :::
-:::{tab-item} Neuron
+::::
+::::{tab-item} Neuron
 :sync: neuron
-```{include} neuron.inc.md
+:::{include} neuron.inc.md
 :start-after: "### Build image from source"
 :end-before: "## Extra information"
-```
 :::
-:::{tab-item} OpenVINO
+::::
+::::{tab-item} OpenVINO
 :sync: openvino
-```{include} openvino.inc.md
+:::{include} openvino.inc.md
 :start-after: "### Build image from source"
 :end-before: "## Extra information"
-```
 :::
 ::::
+:::::
 ## Extra information
-::::{tab-set}
+:::::{tab-set}
 :sync-group: device
-:::{tab-item} TPU
+::::{tab-item} TPU
 :sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
 :start-after: "## Extra information"
-```
 :::
-:::{tab-item} Intel Gaudi
+::::
+::::{tab-item} Intel Gaudi
 :sync: hpu-gaudi
-```{include} hpu-gaudi.inc.md
+:::{include} hpu-gaudi.inc.md
 :start-after: "## Extra information"
-```
 :::
-:::{tab-item} Neuron
+::::
+::::{tab-item} Neuron
 :sync: neuron
-```{include} neuron.inc.md
+:::{include} neuron.inc.md
 :start-after: "## Extra information"
-```
 :::
-:::{tab-item} OpenVINO
+::::
+::::{tab-item} OpenVINO
 :sync: openvino
-```{include} openvino.inc.md
+:::{include} openvino.inc.md
 :start-after: "## Extra information"
-```
 :::
 ::::
+:::::
--- a/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md
+++ b/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md
@@ -67,9 +67,9 @@ Currently, there are no pre-built Neuron wheels.
 ### Build wheel from source
-```{note}
+:::{note}
 The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
-```
+:::
 Following instructions are applicable to Neuron SDK 2.16 and beyond.